Predicting the Success of a Bank Marketing Campaign

Sangramsing Kayte
DataDrivenInvestor
Published in
4 min readJun 24, 2021

--

General Architecture Diagram

Step-by-Step Approach

  • The input dataset contains the labelled data which is the predicted value i.e. target in our current scenario.
  • So, this problem statement can be developed using the Supervised ML approach as it has labelled data and
  • our model can learn from the dataset and this fall into the category of supervised algorithm section.
  • Before performing any task, the initial step in model building is performing pre-processing and EDA
  • i.e. Exploratory Data Analysis, so that we can understand our dataset better.

Pre-processing steps that we have taken

  1. Identify numerical and categorical features
  2. Find any missing values
  3. Plotting the count value for categorical features to see how the features are creating the impact on the ‘target’ variable
  4. After plotting we understood. It is a highly imbalanced dataset with No=0.88 % and yes = 011
  5. If we train the model with this dataset, it is most likely to learn the false predictions.
  6. To avoid this problem, we are going to perform SMOTE which stands for Synthetic Minority Over-sampling Technique
  7. Smote can be done in two ways i.e. under sampling and oversampling.
    But in this case, we will do oversampling which means generating more data as under-sampling is more suitable for the dataset with millions of rows but here it adversely l affect the dataset and we might lose some important data. So, we are going to perform Oversampling.

The next important thing after converting categorical columns into individual features, we have 62 columns i.e. 62 features and definitely we are not going to train the model with these entire features as the model will suffer from the problem of overfitting. So, with the help of a feature selection technique called “Extra Classifier,” we will select only the topmost 10 features that are creating an impact on the target variable. Extra classifier for feature selection: Each feature is ordered in descending order according to the mathematical derivative which is generated by creating a forest of each feature and the user selects the top k from the list.

The graph shows that last_contact_duration is highly important and then it is followed by Euri_3_months.

Selecting the best-supervised ML model is again a challenging task in the project. To select the best one, we will use the technique called “Pair plot”. From the visualization, we can see that the features and predicted values highly overlap and so we cannot use logistic regression. So, then we have to apply either decision trees, or random forest or boost.

But decision trees are suitable for a very small dataset where we it will build decision trees till the last node. But we have a quite good dataset, so I decided to go with the random forest as it is not computational expensive like xgboost.

After applying random forest, it worked well, and we got the model accuracy of 93% and precision and recall with 0.94 and 0.93 for both yes and no labels.

Lastly, the model is deployed using flask and app and API is generated and integrated that API with a simple HTML form for the front end.
And then we can easily predict the success of a bank marketing campaign.

--

--

I am a Machine Learning Scientist with over 9+ years of experience in both the Industrial and Research & Development domain.