Feature Selection Techniques

Published in

DataDrivenInvestor

7 min readJul 14, 2020

What is feature selection?

You all have seen datasets. Sometimes they are small, but often ,they are tremendously large in size. It becomes very challenging to process the datasets which are very large, at least significant enough to cause a processing bottleneck.

The training time and performance of a machine learning algorithm depends heavily on the features in the dataset. Ideally, we should only retain those features in the dataset that actually help our machine learning model learn something.Having too many features pose a problem well known as the curse of dimensionality.

Unnecessary and redundant features not only slow down the training time of an algorithm, but they also affect the performance of the algorithm. The process of selecting the most suitable features for training the machine learning model is called “feature selection”.

Before performing feature selection we need to do Data pre-processing. You can check this

Benefits of performing feature selection:

There are several advantages of performing feature selection before training machine learning models, some of which have been enlisted below:

Models with less number of features have higher explain-ability
It is easier to implement machine learning models with reduced features
Fewer features lead to enhanced generalization which in turn reduces overfitting
Feature selection removes data redundancy
Training time of models with fewer features is significantly lower
Models with fewer features are less prone to errors

Feature Selection Techniques:

Several methods have been developed to select the most optimal features for a machine learning algorithm.

Note: In this article we will discuss the methods which are widely preferred. All the techniques will be implemented independent of each other and not in succession

Filter Method.
Wrapper Method.
Embedded Method (Shrinkage).

Filter Methods:

Filter methods can be broadly categorized into two categories: Univariate Filter Methods and Multivariate filter methods.

The uni-variate filter methods are the type of methods where individual features are ranked according to specific criteria. The top N features are then selected.

Statistical tests can be used to select those features that have the strongest relationship with the output variable. Mutual information, ANOVA F-test and chi square are some of the most popular methods of univariate feature selection.

The scikit-learn library provides :

SelectKBest : It keeps the top-k scoring features.

SelectPercentile: It keeps the top features which are in a percentage specified by the user.

It must be noted that you can use chi² only for data which is non negative in nature.

The example below uses the chi² statistical test for non-negative features to select 10 of the best features from the Mobile Price Range Prediction dataset.

You can download the dataset :

Now we will see how we can remove features with very low variance and correlated features from our dataset with the help of Python.

If the features have a very low variance (i.e very close to 0), they are close to being constant and thus do not add any value to our model at all.It would just be nice to get rid of them and hence lower the complexity.Please note that variance also depends on scaling of the data. Scikit-learn has an implementation for Variance Threshold that does this precisely.

All columns with variance less than 0.1 will be removed

Correlation between the output observations and the input features is very important and such features should be retained. However, if two or more than two features are mutually correlated, they convey redundant information to the model.

We can remove features which have a high correlation. Please note we will be using Pearson correlation for calculating the correlation between different numerical features.

heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

We see that the feature MedInc_Sqrt has a very high correlation with MedInc. We can thus remove/drop one of them.

Now, you might say why not remove irrelevant features by intuition or just looking at the heatmap?

In general it’s advisable not to be influenced by one’s bias or intuition.

In a real-life situation, we would have to deal with more than 3 features (from some hundreds to many thousands, typically). Thus, it would be unfeasible to go through each of them and decide whether to keep it or not. Moreover, there might be relationships among variables that are not easily spotted by a human eye, not even with accurate analysis.

However, in some scenarios, you may want to use a specific machine learning algorithm to train your model. In such cases, features selected through filter methods may not be the most optimal set of features for that specific algorithm. There is another category of feature selection methods that select the most optimal features for the specified algorithm. Such methods are called wrapper methods.

Wrapper Methods:

Wrapper methods use combinations of variables to determine predictive power. They are based on greedy search algorithms.The wrapper method will find the best combination of variables. The wrapper method actually tests each feature against test models that it builds with them to evaluate the results.

Out of all three methods, this is very computationally intensive. It is not recommended that this method be used on a high number of features and if you do not use this feature selection properly, then you might even end up overfitting the model.

Common wrapper methods include: Stepwise/Subset Selection, Forward Stepwise, and Backward Stepwise(RFE).

Here I have mentioned the basic steps to be followed:

Train a baseline model.
Identify the most important features using a feature selection technique
Create a new ‘limited features’ dataset containing only those features
Train a second model on this new dataset
Compare the accuracy of the ‘full featured’(baseline) model to the accuracy of the ‘limited featured’(new) model

Forward Selection:

Identify the best variable (eg. based on model accuracy)
Add the next variable into the model
And so on until some predefined criteria is satisfied

Stepwise/Subset Selection:

Similar to the forward selection process, but a variable can also be dropped if it’s deemed as not useful any more after a certain number of steps.

Now let’s implement various feature selection techniques

1. Backward Stepwise (Recursive Feature Elimination (RFE))

Recursive = Something that happens repeatedly

As the name suggests, Recursive Feature Elimination works by recursively(repeatedly) removing features and building a model on the features that remain.

The example below uses RFE with the linear regression algorithm to select the top 3 features. The choice of algorithm does not matter, instead of linear we can use any other algorithm.

We use feature selection module from sklearn library to apply Recursive Feature Elimination (RFE)

Scikit learn also offers SelectFromModel that helps you choose features directly from a given model.You can also specify the threshold for coefficients or feature importance if you want and the maximum number of features you want to select.

3. Embedded Method (Shrinkage).

Embedded Method is inbuilt variable selection method. We don’t select or reject the predictors or variables in this method. This controls the value of parameters i.e. not so important predictors are given very low weight(close to zero), this is also known as Regularization.

Features selection using models that have L1(Lasso) penalization. When we have L1 penalization for regularization, most coefficients will be 0 (or close to 0), and we select the features with non-zero coefficients.

L2(Ridge) penalization, this adds a penalty, which equals the square of the magnitude of coefficients. All coefficients are shrunk by the same factor (so none of predictors are eliminated).

In the end, I would like to say feature selection is a decisive part of a machine learning pipeline: being too conservative means introducing unnecessary noise, while being too aggressive means throwing away useful information.

If you are curious to learn about missing values treatment, then check this out.

If you found this article useful give it a clap and share it with others.

— Happy Learning

— Thank You