Detecting Within-Range Outliers in Linear Regression

Mahindra Venkat Lukka
DataDrivenInvestor
Published in
4 min readJul 6, 2023

--

The Constant Slope Approach

Photo by Randy Fath on Unsplash

Introduction:

Outliers, data points that significantly differ from other observations, can arise from various factors such as data errors, sampling errors, or natural variation. Typically, outliers are characterized as out-of-range outliers, representing extreme values compared to the rest of the data. However, there are also outliers that fall within the observed range, known as within-range outliers. This article focuses on within-range outliers and presents a method for their detection and removal in the context of linear regression.

Out of Range Outliers Example:

Here, we can see a point with X1 =30 whereas the range of remaining data points is from 0 to 12. So, this outlier is an Out of Range Outlier

Within-Range Outliers Example:

Here, we can see a couple of outliers in the X1 series with values= 22, 20 and a couple of outliers in the X2 series with values= 10, 3. The range of remaining data points of the X1 series is from 0 to 24, and the range of remaining data points of the X2 series is from 0 to 12. So, these outliers are present inside the range and still be identified as outliers due to deviation from trend/pattern. These outliers are called Within-Range Outliers

Detection & Removal for Linear Regression:

Linear regression involves fitting a line that minimizes the sum of squared differences between observed values (y) and predicted values (ŷ) based on the regression equation. The presence of outliers can negatively impact the performance of the regression model, reducing its explainability (as measured by the R² score).

Common methods for detecting outliers, such as the Z-Score and Interquartile Range (IQR), are effective for identifying out-of-range outliers. However, these methods may not be as suitable for within-range outliers since they focus on data points that are significantly distant from the rest.

Detecting Within-Range Outliers:

Detecting within-range outliers can be challenging, especially when dealing with large datasets and automated regression pipelines. However, it can be accomplished by leveraging the constant slope principle in linear regression.

Consider a linear regression equation, Y = a.X1 + b.X2 + c, where X1 and X2 are independent variables, and Y is the dependent variable. According to this equation, a change in X1 by z% leads to a corresponding z% change in Y, while keeping other variables constant. Consequently, the ratio of Y to X1 should remain constant in linear regression.

To detect within-range outliers, create a new variable representing the ratio of Y to X1. Then, apply the Z-Score or IQR method to identify outliers. Outliers will exhibit significantly different slopes compared to the rest of the data points, indicating a deviation from the expected pattern. This detection approach can be seen as analogous to finding Cosine Similarity.

A spreadsheet example with formulae is below. Outliers bolded.

Calculated the Y/X1, Y/X2 ratios and then used the IQR method on top of it to identify the outliers. Q1 is 25th percentile, Q3 is 75th percentile. L Bound & U bound are the Lower and Upper bounds respectively and anything out of this range is considered as outliers.

Implementation in Python:

For automating within-range outlier detection and removal, the following Python code snippet can be utilized:

# We have a dataframe 'df' with Y, X1 as variables#Create Y/X1 ratio as a new column
df["Y/X1"] = df["Y]/df["X1"]
# Calculate Q1, Q3 and IQR (Interquartile Range)
Q1 = df["Y/X1"].quantile(0.25)
Q3 = df["Y/X1"].quantile(0.75)
IQR = Q3 - Q1
# Obtain df_New by selecting data points that are within the Lower Bound (Q1 - 1.5 * IQR) and Upper Bound (Q3 + 1.5 * IQR) rangedf_New = df[~((df["Y/X1"]<(Q1-1.5*IQR)) | (df["Y/X1"]>(Q3+1.5*IQR)))

By applying this code, the resulting df_New dataframe will be free of within-range outliers and ready for fitting a linear regression model.

Conclusion:

Detecting and addressing within-range outliers is crucial for maintaining the accuracy and interpretability of linear regression models. By leveraging the constant slope principle, we can identify data points that deviate significantly from the expected pattern. The presented method, using the Y/X1 ratio and applying the IQR method, enables the automated detection and removal of within-range outliers.

Subscribe to DDIntel Here.

Visit our website here: https://www.datadriveninvestor.com

Join our network here: https://datadriveninvestor.com/collaborate

--

--

Search Capacity Planning at Amazon || MS in Business Analytics from W. P. Carey School of Business, Arizona State University || My opinions are my own