Credit Score using Machine Learning

Manuel Silva Robalinho
DataDrivenInvestor
Published in
6 min readMar 2, 2020

--

Score to customer credit system

The goal is to use machine learning to create a credit score for customers. This score gives the degree of confidence that the customer will meet the agreed payments.

The higher the score, define the greater the probability of non-payment.

Multiple Linear Regression in Python with Scikit-Learn

We just performed linear regression in the above section involving two variables. Almost all the real-world problems that you are going to encounter will have more than two variables.

Linear regression involving multiple variables is called “multiple linear regression” or multivariate linear regression. The steps to perform multiple linear regression are almost similar to that of simple linear regression.

We will use customer information to generate a ‘trust’ score on the customer. The scoring formula can be adapted for each company according to its credit context. In this example we are going to use the average number of days the customer is late, and the average billing amount for the past 2 years to calculate a score that combines the 2 information.

After calculating the score, we submit the information to a machine learning with Scikit-Learn, so that the system can predict new scores based on the learning information.

Our formula for Score calculation described on Score calculation.xlsx

Customer information is in the excel: Customers_CODE.XLSX

Customer company information:

Customer from date

State, Region, Postcode, Salesman, Main CNAE (type of company classification in Brazil)

Highest Billing Date

Maximum billing amount

Last Date invoice issued

Largest credit exposure date

Highest credit exposure

Average historical delay

Average revenue last 48 months

Amount payable Overdue

Amount payable due

Customer Last Order Date

Date of this information

SERASA Information ( Serasa it’s a company that sell’s information about other companies)

Serasa Score

Probability of not paying

Last date non payment

Amount of unpaid documents

Value of unpaid documents

Last date bad checks

Amount of bad checks

Last date protests

Value protests

Last date judicial actions

Value judicial actions

Last date overdue debts

Value overdue debts

Some Python code to obtain data and clean data:

Read excel file:

Watching the read information:

Analysis of null information columns:

Another form could be that, that present you only de null columns:

To obtain better data, we filter the data with customer with one least one invoice in the last 2 years:

Rename some columns to better identification:

To columns with date information, we transform data into days and transform column type to Integer. The column Data_ult_atualizacao has today date.

The columns with Postal Code and CNAE was converted to a number and the columns changed to type integer:

Replacing NaN columns with 0 (zero) :

Creating a partial dataset with valuables columns, to the Train:

Analyze the copied data set:

Transform Days columns to float:

After we create a routine to read all records and calculate my score (newscore) using my definitions. I have an excel file with my definitions to calculate my score (file: Score calculation.xlsx)

Some prints of the calculation score:

Analysis of the correlation, we can watch the columns with more correlation with them to obtain the new score:

We can watch them in a graphic base:

Some facts about our information, customers by region. Most customers are from the same region.

Data information by CNAE. Most customers are from the same CNAE.

Customers by vendor:

Customer information by transport type of transport that is used for customer deliveries. We have the most of then with the type started by ‘FO’.

Relationship between the new score and due days:

We need to transform Region in a number:

Now we can model with Keras:

Create Train and Test datasets, 80% for Train and 20% to Test

Training the model:

Obtain the intercept and slope values:

These values could be printed in a format with better information:

Comparing Train with Test:

In graphic mode, we can compare Train and Test results:

I need to predict some individual records, so, I made a function to predict score:

Running to predict my record:

Using tkinter python library, we can create a screen with better visual:

The creation of a score using the information known to a customer, can automate and make a credit system more reliable. This approach contributes to cooler and more reliable risk analysis, resulting only from market data and information, removing from the system the criteria of proximity to the client and the emotions that negotiation can generate.

References:

--

--

Master's degree in Informatics. Interest in AI, Machine Learning and Deep Learning. More about me on http://mrobalinho.blogspot.com.br/