Credit Score using Machine Learning

Published in

DataDrivenInvestor

6 min readMar 2, 2020

The goal is to use machine learning to create a credit score for customers. This score gives the degree of confidence that the customer will meet the agreed payments.

The higher the score, define the greater the probability of non-payment.

Multiple Linear Regression in Python with Scikit-Learn

We just performed linear regression in the above section involving two variables. Almost all the real-world problems that you are going to encounter will have more than two variables.

Linear regression involving multiple variables is called “multiple linear regression” or multivariate linear regression. The steps to perform multiple linear regression are almost similar to that of simple linear regression.

Machine Learning in Finance | Data Driven Investor

Before we cover some Machine Learning finance applications, let's first understand what Machine Learning is. Machine…

www.datadriveninvestor.com

We will use customer information to generate a ‘trust’ score on the customer. The scoring formula can be adapted for each company according to its credit context. In this example we are going to use the average number of days the customer is late, and the average billing amount for the past 2 years to calculate a score that combines the 2 information.

After calculating the score, we submit the information to a machine learning with Scikit-Learn, so that the system can predict new scores based on the learning information.

Our formula for Score calculation described on Score calculation.xlsx

Customer information is in the excel: Customers_CODE.XLSX

Customer company information:

Customer from date
State, Region, Postcode, Salesman, Main CNAE (type of company classification in Brazil)
Highest Billing Date
Maximum billing amount
Last Date invoice issued
Largest credit exposure date
Highest credit exposure
Average historical delay
Average revenue last 48 months
Amount payable Overdue
Amount payable due
Customer Last Order Date
Date of this information

SERASA Information ( Serasa it’s a company that sell’s information about other companies)

Serasa Score
Probability of not paying
Last date non payment
Amount of unpaid documents
Value of unpaid documents
Last date bad checks
Amount of bad checks
Last date protests
Value protests
Last date judicial actions
Value judicial actions
Last date overdue debts
Value overdue debts

Some Python code to obtain data and clean data:

Read excel file:

Watching the read information:

Analysis of null information columns:

Another form could be that, that present you only de null columns:

To obtain better data, we filter the data with customer with one least one invoice in the last 2 years:

Rename some columns to better identification:

To columns with date information, we transform data into days and transform column type to Integer. The column Data_ult_atualizacao has today date.

The columns with Postal Code and CNAE was converted to a number and the columns changed to type integer:

Replacing NaN columns with 0 (zero) :

Creating a partial dataset with valuables columns, to the Train:

Analyze the copied data set:

Transform Days columns to float:

After we create a routine to read all records and calculate my score (newscore) using my definitions. I have an excel file with my definitions to calculate my score (file: Score calculation.xlsx)

Some prints of the calculation score:

Analysis of the correlation, we can watch the columns with more correlation with them to obtain the new score:

We can watch them in a graphic base:

Some facts about our information, customers by region. Most customers are from the same region.

Data information by CNAE. Most customers are from the same CNAE.

Customers by vendor:

Customer information by transport type of transport that is used for customer deliveries. We have the most of then with the type started by ‘FO’.

Relationship between the new score and due days:

We need to transform Region in a number:

Now we can model with Keras:

Create Train and Test datasets, 80% for Train and 20% to Test

Training the model:

Obtain the intercept and slope values:

These values could be printed in a format with better information:

Comparing Train with Test:

In graphic mode, we can compare Train and Test results:

I need to predict some individual records, so, I made a function to predict score:

Running to predict my record:

Using tkinter python library, we can create a screen with better visual:

The creation of a score using the information known to a customer, can automate and make a credit system more reliable. This approach contributes to cooler and more reliable risk analysis, resulting only from market data and information, removing from the system the criteria of proximity to the client and the emotions that negotiation can generate.

References:

MRobalinho/ML-Credit_Score

Using machine learning to create a credit score to customers We just performed linear regression in the above section…

github.com

(Tutorial) Introduction to GUI With TKINTER in PYTHON

Before you begin, you should be familiar with Python to learn Tkinter. If you're new to Python, check out DataCamp's…

www.datacamp.com

Python GUI examples (Tkinter Tutorial) - Like Geeks

Learn how to develop GUI applications using Python Tkinter package, In this tutorial, you'll learn how to create…

likegeeks.com

Python Machine Learning Multiple Regression

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict…

www.w3schools.com

Simple and multiple linear regression with Python

Linear regression is a linear approach to model the relationship between a dependent variable (target variable) and one…

towardsdatascience.com