Build your data science project step by step with python

Why data science is important nowadays ?

Published in

DataDrivenInvestor

16 min readNov 10, 2020

In order to efficiently use their data, companies are moving from traditional data analytics tools that provide descriptive analytics to visualize “what happening” to predictive and perspective analytics with data science tools.
Predictive analytics help in identifying impacting factor of a target event and answer “why it happens” and finally perspective analytics would use statistical modeling to know “what will happen in the future”.
I wrote this story to help professionals (software developers, accounts, …), data science enthusiasts and other persons intending to build a career reconversion toward data science.

This article would clarify key steps to build a simple predictive model and translate data into business recommendations.
Data science is used in strategic marketing, supply chain management, risk analysis, client churn, cost reduction, diffaiture, etc, predict future sales and future depenses.
Firstly, let’s answer a frequently asked question : what is difference between data engineer & data scientist.

Data Team Roles :

Data Scientist role : focuses on using statistical methods to build model able to answer business questions.
The main focus would result in achieving a high accuracy predictive model and could be interpolated by business specialists.

Data Engineer role : he care about putting project into production environment processing performance such as memory usage, processing latency. He develop a maintainable and normalized code source. So the main focus would result in building an architecture to manage the data lifecycle (ingest, collect, store & transform) and defining the technical environment that would enable reaching an optimized and automated processing.

Business expert : Yet another mandatory role in a data team. He’s a specialist of a specific domain (domain of the project ) such as Risk Analyst, Marketing specialist, sales, consumer service analyst,etc. He help data scientist in modeling.

Step A : Defining requirement & modelisation

In a data model, there two principal types that should be defined;

Explicative Variables usually noted as ( X ) refers to the set of input data used to predict an Explained Variable ( y ) with a predictive model. The explained Variable could be called the target variable & dependent variable.
In general, explicative data ( or features ) gives information on observed information such as client profile, behaviour, orientation, product features, etc. They are related to events that occurred either in the past.
Whereas the target variable is unobservable ; client churn, product failure, fraud could be detected only if reclaimed.
Building a machine learning model would help not only in prediction of that unobservable business value but also in explaining factors behind.

Taking this example : “ Product Reclamation” once a product is bought from the internet we would have known and stored in the dataset the buyer profile and product features. Reclamation would be collected separately from users assistant, social networks, users feedback, emails,etc.
If the company would like to know whatever the reclamation is due to product deficiency or user mis-using, data scientists would build a model by taking the explicative variables ( user profile, product feature, usage conditions, …) to predict the reclamation ( target variable ).

These explicative variables are usually provided by the business expert ( expert in accounting, risk analyst, finance, etc ) to help data science in modeling.

Supervised & Unsupervised Learning

They are two common learning types: supervised learning & unsupervised learning.
Supervised learning aims at finding a model that approximates ( predict ) a target variable based on a set of input features.
Supervised learning could be classification when the output is a categorical type (or qualitative), regression when the output is a quantitative type.

Data types

There is quantitative data that represents a quantity such as seniority years, duration, sales, cost, profit, etc.
The second set is qualitative data describes a quality such as sex, job title, Socio-Professional Category, etc.
In this article I would explain how to build other steps by showing some examples for each step.

I am going to use two projects : Credit Risk Analysis, ATM Cash Prediction.

Example 1 : Performing credit risk analysis helps the lender determine the borrower’s ability to meet debt obligations in order to cushion itself from loss of cash flows and reduce the severity of losses.
Example 2 : ATM Cash Prediction
Having historical data from ATMs can help to create intelligent cash management systems based on cash demand forecasting and this eventually will help to reduce financial costs

Step B : Data Cleaning & Normalization

Since data are collected from different systems, we could have some heterogeneity resulting in different formats.
Indeed, due to some errors related to data collection or chargement we could have some modifications such as punctuation and duplicated data.
Therefore, to make data ready for a modelization step, one should perform transformations to renormalize data.
After reading data, a step of data types exploration and format verification should be performed.

Data Reading & display step

Duplicates Removal step

We would define the set of quantitative variables that i defined here in the list num and the list of qualitative variables defines as dummies and then the target variable.

Data Normalization step

Now let’s check the format of different dummies by checking the existing unique values in each column.

On that sample, we have the Risk value with two values ‘good’ and ‘G’ to describe the sample quality and the same thing for ‘bad’ and ‘G’ and the transformation has to be done to map into a normalized value.
Similarly, we could see indesirable ponctuation on “Housing” column ; we have “free \s\s\s\s ” and “free” considered as two different values whereas they describe the same value.

Pandas Command to normalize the content of “Housing” & “Risk” from credit dataframe

Other Normalization steps :

Of course normalization steps may differ from one project to another ; here I will mention some other normalization that does not exist in this project but should be taken into consideration.

Date formats : in some datasets, we could have different date forms such as mm/dd/aaaa or mm-dd-aaaa or aaaa-mm-dd and then all those forms should transform into a single normalized format.
Float formats : depending on original data storing we could find a float represented as xxx,xxx or xxx.xxx and then standard python float type should be adopted.

Type casting step

A natural step that comes just after data normalization is types checking ; data before normalisation are usually considered as undefined objects or string; once normalization we should check if new data type corresponds to the desired type or not in order to perform next computing.

Here an example of the ATM project shows that “date” column is defined as object whereas and not date type and the same thing with month and the we could develop a transformation to cast that types

types casting on ATM dataframe

Step C : Feature Engineering

It’s the most important and generally represents around 60% — 80% of the whole data science project since it’s compulsory in defining the model accuracy.
Noting that order of presented steps is not mandatory ; these steps could be done at any level and even repeated many times.

Step 0 : Irrelevant features removal

Before starting development of features and in order to reduce memory , we could remove some irrelevant features that are not related to the target variable such as low-variance columns and columns containing clients ids or codes if it won’t be used for joins later.

Step 1 : Computing new variables & indicators

This step consists of computing informative indicators from existing data.
It’s one of the most important steps of the whole process since the prediction model is like humains,it makes better decisions as much as the provided information is pertinent.

Let’s see some examples on ATM dataset :
We would like to introduce some information regarding the year of measure, average demand by ATM at a given quarter.
We have “date” column representing the full date of withdrawal operation and since dates are not eatable raw, let’s see how to cook them ;)

Step 2 : Dealing with Missing Data

In data science litterature, there are two ways to deal with missing in a dataset : missing data removal & imputing missing data.
The “df.info()” command of pandas displays a number of non-null values per column.
It’s recommended to compute the rate of NaN values horizontally and vertically (i.e rate by columns and rate by lines).

Missing data Removal :

Line Removal : If for a given line this ratio is high it’s recommended to delete the entire line as it won’t provide valuable information.
Column Removal : Theoretically, 25 to 30% is the maximum missing values are allowed, beyond which we might want to drop the variable from analysis.
But if that variable is qualified by business experts as important for analysis we could tolerate higher rates.

Missing data Imputation:

Imputing missing data consists in filling null values with some data that must be neutral in order to avoid leading to wrong decision making.
There exist several common techniques to deal with missing data either for categorical or continuous data.

Imputing categorical data could be done either by replacing NaNs with the largest represented mode but still unappliable if the NaN is the largest one.
A more general way to replace that nans is to introduce a new mode to represent “unknown” modes.

Filling missing categorical data with new value “unk”

Imputing quantitative data is commonly done either with zero-imputation, mean-imputation ( fill with the mean of column ) or mode-imputation ( fill nans with the mode of column data).
The choose of imputation strategy depends mainly on the data and the task and the main selection criterion is to conserve neutrality of data after filling procedure.

Example of filling numerical data with correspondant mean value

This method suggests to artificially generate and then their dependence with the target variable would be random; a smarter way to deal with that missing data is to exploit and take advantage from features inter-dependencies and build predictive ML model to predict missing values from other nan missing columns data.
Therefore, this could be done by sorting columns at an increasing missing rate and then using low-rate columns to predict values in high-rate columns.

Step 3 : Regrouping low representative dummies

In some projects, we could find modes with low occurence in a given variable; they won’t provide relevant information ; we group them into a single mode.

Step 4 : Scaling & Dummies representation

In a data project, we could find quantitative data with different scales (10 , 1K, 10K ,etc ) and it wouldn’t fit into a predictive model.
In order to make accurate predictions, data should be rescaled into a normalized range of values.

There are two main commonly used scalers that are StandradScaler & MinMaxSclaer; they are already implemented on scikit-learn.
And of course, dummies representation of categorical data is an intuitive step.

Step 5 : Dealing with categorical variable with high cardinality

In some projects, we could find some qualitative variable with high cardinality i.e >70 and could reach over 1.000 in some cases.

One-hot encoding is then infeasible since the size of input table that aliments prediction model would be high and even leads to model performances decrease.

There are some common methods to deal with that high cardinality resulting mainly in regrouping a set variable in a logical manner that provides relevant information on the target.

A common example is zip code and regions in general such as city or country; in some datasets, we find over 500 different values of zip.
The common method consists in regrouping together codes starting with the same digits since they represent neighboring regions.
Applying a clustering algorithm is also commonly used to form groups of similar items in the same cluster.

In order to cluster data we firstly have to set up a segmentation criterion.

For example, a set of tags to describe an entity (a movie for example) could be regrouped using co-occurrence and then clustering would be done on co-occurrence matrix ( or correlation matrix ).
Another Example, in a cost prediction project we have a set of >1000 providers. we could select as segmentation criterion statistics on delivery time or statistics on cost; a combination of multiple segmentation criteria is also possible.

So let’s try that on our dataset,
I have data coming from 300 different ATMs with unknown locations; How to include information on the ATM in the prediction model ?

I will choose minimum yearly cash demand as a segmentation criterion.
The pivot transformation :

Clustering the pivot table values :

How to select optimal cluster number ?
Answer : there are several mathematical criteria whereas clusters number is not a crucial parameter; the most important that clusters interpretability and relatedness to target variable.
we try clustering two common clustering algorithms (KMeans & AgglomerativeClustering) on different, according to optimal number corresponding to the gap in silhouette scores. ( in the presented figure there gaps in cluster = 3 and cluster = 13

According to Elbow methods , the optimal number of cluster corresponds to the first element when inertia values start convergence. (opt = 14 ) in the example.

These two criteria did not appears always since it depends on the selected segmentation criterion making regrouping data an heavy task.
Once decision regarding optimal clusters number alongside clustering algorithms we would apply that choice to associate cardinalities with correspondant cluster.

Step 6 : Joining data

Finally, if a data project is composed from multiple table sources, we would join prepared data to form the features array to be used in the predictive model.

Step D : Feature Selection

In some data projects we could have around a~100 columns.
Dealing with high data dimensionality may lead to model performance degradation due to the curse of dimensionality besides long training time.
In addition to that, some information redundancy and inter-correlation could occur alongside variables irrelevance.
In literature, there are three main feature selection methods : Filter methods, Wrapper methods and Embedded Methods.

Filter method process for feature selection

Filter methods uses statistical tools to evaluate relevance of features with target variables before learning the algorithm;
They are mainly based on statistical tests for pairwise independence analysis between each explicative variable and the target variable such as Chi-square test for categorical variables, fisher score, etc.
In practice, these score-based tests fail in finding adequate variable selection.

Wrapper method process for feature selection

Wrapper methods are based on greedy search algorithms as they iteratively select a subset of features and combine the selector with a machine learning algorithm and then optimal selection is chosen based on best subset performance. (RFE : Recursive Feature Elimination) is the most commonly used selector.

Embedded method process for feature selection

Embedded methods use the feature selection method of the algorithm itself; such as weights of linear regression, feature selection of tree-based algorithms like tree decision, xgboost, random forest,etc.
They evaluate the contribution of a given feature in making the algorithm decision.
You could refer to that articles for more details about advantages and disadvantages of each method
Filter Methods , Wrapper methods and Embedded Methods.
These methods could be combined to perform an optimal feature selection with best algorithm accuracy.

Correlation Study :
Let’s see on a practical how to detect correlations between features on the credit risk dataset.
we firstly compute PCA on scaled data and plot cumulative sum of ordered eigenvalues.

This curve converges at the 19th value with 99% of total cumulative sum meaning that information could be reduced to 19 decorrelated input data.

Considering PCA projected features as input of training algorithms won’t help in identifying key important factors and make the model uninterpretable.
Let’s consider another correlation study concept.
Visualization of correlation matrix helps data scientists in detecting pairwise correlation.

In the following python algorithm, we suggest to detect the most correlated features by taking 0.5 as a correlation threshold. ( The recommended threshold is 0.8 )

Step E : Prepare for training

Now that our data is ready to train and score, there are two other pre-training steps that need to be done in order to evaluate the quality of the trained model and ability to generalize on new unseen data.

Data Splitting :

Therefore, we have to define three datasets :
- Train data : a sample of available data used to learn the model
- Validation data : its as well a sample of labeled data used to evaluate quality of model and it’s ability to generalize on new data
- Test data : is new data used once a model deployed in a production environment. it’s usually completely decorrelated from train data.
Generally, if large dataset is available, we could split data into three groups (train,validation,test) but if only small amount of data is available we only take train and validation.
By scoring the three data samples, the largest score will be attributed to the train data as model learned existing patterns and followed by validation score.
If a large decay between train score and validation score appears, we have an overfitting of data and we have to re-model to better have better generalization properties.
As they comes from the same dataset, validation and training data could have small correlations ( for example they are token in a same month, in same region, etc ) whereas test data is generally decorrelated (as we test on different month, region, etc) and then test score could lower than validation score.

Data Balancing :

It’s particularly used with unbalanced classification projects; taking the example of “product reclamation” : reclamation rate would represent only 3% of whole data.
Keeping train data with such unbalanced rates would make model favor
majority class and unable to detect that infrequent class.
So balancing consists in equalizing classes presence in data ;it could be done either by downsampling (reducing size of majority class) or upsampling (increasing size of infrequent class).
Yet, another method consists in using a parameter in the training algorithm (tree-based algorithms) to balance during fitting.

In credit risk data, we have 30% of “bad” score and 70% of “good”.
In practice, the following sklearn is used to split data into train and validation data with 30% of validation size ( % of validation data )
The following python code balance data and then applies the split :

Step F : Prediction Model

So now,as our data is prepared and balanced, they are ready to be eaten by model.
We deploy training on train data set and then evaluate the model on validation data.
There are several classification algorithms such, i am not going to go in depth with mathematical details of each algorithm, i just present them.

SVM : Support Vector Machine
RF : Random Forest
TD : Tree decision
MLP : Multi-layer Perception
LR : Logistic Regression
NB : Naive Bayes
LDA : Linear Discriminant Analysis
LGBM : Light Gradient Boosting Machine
XGBoost : Extreme Gradient Boosting

There are some metrics used to evaluate score of validation data namely : accuracy, F1-Score, recall, confusion matrix.

I use classification report to visualize precision per class.
Accuracy is not the best model selection criterion since it does not give information on separability capacity.
alternatively, AUC ROC ( Area Under the Receiver Operating Characteristic ) score is used instead.
Yet, model has to be validated by business experts or/and researches once it provides a logical interoperability in adequance with business requirements.
Therefore, displaying feature importance would help in making decision regarding model adequacy.

Feature importance given by random forest algorithm

We could even visualize feature importances by label :

Feature importance given by random forest algorithm for “bad” class

According to that graph, as credit amount goes larger, risk becomes higher witch is an logical.
On the other side, juniors tend to have risky credit witch is as well a reasonable resultat as juniors has less resources than seniors.

Concluding Remarks :

This story treats main steps to build a data science project for especially a classification task.
Noting that, there is not standard method to build a data science project and then some steps could be absent in a project as they are not needed.
Besides, order of feature engineering steps did not obey to a special order, all that steps could be done as much as need in a project.

Enhancing model is a interative buckled cycle, it may be done with review of feature engineering choices such as add of more informative features (more variables that would give better information on explained variable), review of data imputation, data segmentation, etc.
Another process that would help in boosting performance is introduction of an historical information especially when a value depend on what happens on the past.
Another trick that could help in getting better accuracy is to segment by population and perform a predictive model by population ;)

Python Codes are provided here