Math behind Data Science — Vol 3

Linear Regression and Modeling.

Oleksii Kharkovyna
DataDrivenInvestor

--

Image by Andra Ion from Unsplash

Linear regression…

If you plan to step into data science, you will hear these two words many times. A lot of projects require this statistical method because it gives great possibilities to analyze data and make predictions.

That’s why I have put extra effort to do this post and explain Linear regression in a clear way. As always, I will share some brief explanations, my insights, and useful links on how to go deeper and learn it faster.

This is the third part of a series of posts where I translate complex math behind data science into simple definitions. In two previous posts, I uncovered Functions, Multivariable Calculus & Graphs and Statistics and Probability.

Now, let’s talk science!

Why Data Scientists Need Linear Regression?

Image by Akharis Ahmad from Unsplash

Let’s first figure out why do we actually need LR.

Linear regression aka General Linear Models are used both for data science and machine learning. Be it a business trying to forecast its profits, or scientists trying to figure out new discoveries governing the laws of the universe — all of such projects require linear regression.

Why Regression?

Analysis & prediction — that’s the short answer. Just like any other statistical stuff used in data science, regression helps to extract insights from the data.

The major goal of LR is to analyze relationships between several sources of data. The second one is building a model that can help you predict an output based on a given set of inputs (independent variables).

So, LR helps to analyze different scenarios and predict possible outcomes.

Why Linear Regression?

Apart from linear, there are other types of regressions too, like logistic and multivariate models. Each of them is equally important and focused to build an equation to help us learn more about data. The only difference is cases when they are used because it all depends on the type of problem, data, and distribution. If you got interested, I’ll make a separate article for this.

But now, let’s understand linear regression.

Variables in Linear Regression

Linear regression attempts to model the relationship between two variables by setting a linear equation to the observed data. Here are the types of variables:

  • dependent variable (this one impacts outcome);
  • independent (explanatory) variable.

A dependent variable is a variable whose values we want to explain or forecast. It also can be called a predicted variable since we use this variable as a base for prediction. It is mostly referred to as X.

The independent or explanatory variable is a variable that explains the other variables. It can be also called Criterion variable and referred to as Y.

Depending on the case, there can be simple regression with single input variable (X), or multiple regression with multiple input variables.

Residuals and Coefficients / Weights

There are two more important concepts used in LR.

Residuals: Sometimes called “error.” This is the difference between the predicted vs actual response. The more (relevant) independent variables you add, the more (hopefully) your difference between the two decreases, indicating a model with increased accuracy / predictive power.

Coefficients/weights: represents / measures the correlation between your input (explanatory) and output (response) variables.

Putting it all together

Variables, residuals, and coefficients — all the stuff described above are needed to build an equation that will run a linear regression.

The formula may look like this:

In this case, we have three explanatory variables (x, y, and z) — this is our input or data we have. Y is an independent variable, that is our predicted output, or response.

9 is y-intercept — the value at which the fitted line crosses the y-axis.

On the basis of the equation, we can build a plot:

Orange points are explanatory (x-axis) and response (y-axis) variables, blue line is prediction for those same points.

Example of LR

Imagine we have a seed of a tree, and we need some water to grow it. So, the height of the tree will depend on the frequency of pouring it with water.

In this case water in an independent variable (Y).

The height of the tree is a dependent variable as it depends on the amount of water it receives (X).

Multiple regression

Multiple regression is used to explain the relationship between a dependent variable and more than one independent variable.

Here is an example of a plot of multiple regression with a relationship between possible income (y), and two independent variables (x) — seniority and years of education.

Instead of a line we can use a plot like this to find the best fit.

Cost Equation

After you predicted values with linear regression, you can compare it with actual input you had. This in needed to improve the accuracy of a model.

To measure the difference between an observation’s actual and predicted values, we can use the cost equation, or error equation:

The cost equation is sometimes used as a measure of how well different iterations of our model perform — the lower the cost / error, the better your model is.

Optimization aka Gradient descent

Image by Claudio Schwarz from Unsplash

The next important concept within linear regression is optimization. Just like the cost equation, this method is needed to boost the accuracy of the model. More specifically, to find the best / most accurate linear regression model.

Optimization is done by finding the lowest cost/error. This algorithm focuses to minimize some function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.

Gradient

This is our matrix of values of first order partial derivatives with respect to a weight or the y-intercept.

Source of the image — Gradient Descent Example

Learning Rate

The learning rate is a “hyperparameter” in linear regression: a value we can arbitrarily choose that affects how the algorithm works. In many machine learning algorithms, the best choice for a hyperparameter depends on many complex factors and there may be no simple answer, and yet often there is a recommended value as a starting point. In many libraries, a default will be chosen for you. The learning rate is one of two components used to update our weights (the other being our gradient).

Underfitting and Overfitting

In simple words, overfitting is a good performance on the training data, poor generalization to other data. On the other hand, underfitting is a poor performance on the training data and poor generalization to other data.

Underfitting occurs when a model can’t accurately capture the dependencies among data, usually as a consequence of its own simplicity. It often yields a low 𝑅² with known data and bad generalization capabilities when applied with new data.

Overfitting happens when a model learns both dependencies among data and random fluctuations. Complex models, which have many features or terms, are often prone to overfitting. When applied to known data, such models usually yield high 𝑅². However, they often don’t generalize well and have significantly lower 𝑅² when used with new data.

How to Implement LR in Python?

The last thing I want to share in this article is advice on how to implement LR in Python.

For this, you need to apply the proper packages and their functions and classes. I would suggest using such Python libraries as Statsmodels and scikit-learn. I will explain why.

Statsmodels — provides convenient access to a number of different attributes and also has a helpful model.summary() method that calculates many helpful things like r-squared and f-statistic.

The package scikit-learn is open source and provides the means for preprocessing data, reducing dimensionality, implementing regression, classification, clustering, and more.

You can check the page Generalized Linear Models on the scikit-learn web site to learn more about linear models and get a deeper insight into how this package works.

Final words

I hope this article was enough for you to get the meaning of linear regression in data science. If you still have questions or want me to cover some specific part of this topic, leave your comments.

To learn more about LR, I suggest these links that might help you to go deeper:

Read more of my articles on AI, ML & Data Science on Medium, or Linkedin.

Good luck and cheers!

--

--