The Math and Intuition Behind Gradient Descent

Suraj Bansal
DataDrivenInvestor
Published in
10 min readDec 1, 2019

--

Agile software development defines the iterative product development process by which the following steps are exercised.

1) Construct products after conducting market research
2) Commercialize product and enter the marketplace
3) Measure consumer satisfaction and market penetration
4) Respond to feedback and iterate product
5)R E P E A T 💱 💱 💱

This process essentially tests the market, collects feedback and iterates the product until you’ve reached maximum market penetration with minimal error. The cycle repeats multiple times and ensures that consumers can provide input at each step to influence what changes should be made.

This seemingly simple process of constant iteration is actually reflected in the principles of gradient descent. Gradient descent can be approached by first calculating gradients of the cost function and then updating existing parameters in response to the gradients to minimize the cost function.

Gradients are converting functions with numerous variables into 1 vector, but we’ll discuss that later 😉

WOAHHHHHHHHH 😵😵😵 hold up now- that looks super complex

You would actually be surprised- but don’t worry.

Understanding the multivariate calculus behind gradient descent can be extremely daunting- I’m gonna explain the intuition behind gradient descent and only communicate the math concepts that are requisite to your comprehension. Would highly recommend visiting my article or video on Machine Learning for reviewing the basics first!

GRADIENT DESCENT VARIANTS- THERE’S MORE THAN 1

There are three primary variants that are implemented with machine learning algorithms- each one differs in computational efficiency and reveals their own unique benefits.

NUMBER 1️⃣

Perhaps the simplest type would be batch gradient descent. This entire process can be seen as the training epoch which accounts for how many times the training vectors are used to update the model’s weights.

The error in batch gradient descent is calculated for each individual example of the training set and updates model parameters after all training points go through the machine learning algorithm in one epoch.

The error gradient and convergence rate are stable with this method and an adequate level of computational efficiency is achieved. However, since the model only iterates the weights after the entire training set has been analyzed the state of convergence may not be the most optimal state- the model could achieve one that is more accurate!

NUMBER 2️⃣

Enter… stochastic gradient descent! The fundamental difference between these two methods is that stochastic gradient descent randomizes the entire dataset and updates the weights and parameters with respect to each individual training example whereas the batch method updates parameters after the entire training set was analyzed.

The consistent updates to the model provide more accurate rates of improvement and much faster computation. However, the frequent changes result in noisier gradients- this means that it oscillates within the general area of the minima (the point where the cost function is lowest). Therefore, some variance will exist with each test run.

Okay, there are some obvious benefits and drawbacks associated with both methods- which method is better to implement for your machine learning model? Trick question- neither!

THIRD TIME’S THE CHARM 3️⃣ 🏅

Enter… mini-batch gradient descent! This essentially combines the efficiency of batch gradient descent and overall robust properties of stochastic gradient descent.

This method works by clustering the dataset into smaller batches, usually between 30–500 training points and performs model iterations for each individual batch. This reduces variance in parameter updates by using a highly optimized matrix to improve efficiency and accuracy ⭐️ ⭐️ ⭐️

Any gradient descent variant will be modelled with the following formula. This iteration is executed after every time the model undergoes backpropagation until the cost function reaches its point of convergence.

Where the weight vectors exist in the x-y plane and the gradient of the loss function with respect to each weight is multiplied by the learning rate and is subtracted from the vector.

The partial derivatives are the gradients used to update values of θo & θ1 and the alpha represents the learning rate which is essentially the hyperparameter that the user must specify. M represents the number at which the update should stop and i represents the starting point.

SOME QUICK MATHS

PARTIAL DERIVATIVES

We know that functions with multidimensional inputs have partial derivatives where multivariable functions find the derivative of one variable with respect to the others as constants- but what about the entire derivative of said function 🤔

Let’s understand the math behind partial derivatives first. Computing multivariable functions like f(x,y)=x²y could be broken down like this:

Okay I know what you’re thinking- derivatives are already tedious and difficult by themselves; why use partial derivatives instead of the entire derivative!

Well, function inputs are comprised of multiple variables- hence the concept, multivariate calculus. Partial derivatives are used to assess how each individual variables changes with respect to the others as constants.

GRADIENTS

Gradients essentially output 1-dimensional values of multidimensional inputs of scalar-valued multivariable functions. Gradients represent the slope of the graph’s tangents that point towards the direction of the functions greatest rate of increase. This derivative represents the incline or slope value of our cost function.

In essence, the gradient of any given function f, generally denoted by ∇f is the collection of all partial derivatives interpreted into a vector.

Imagine standing at the point (x0,y0 …)in the spaced input of f. The vector ∇f(x0,yo …) will identify which direction to travel that increases the value of f the quickest. Fun fact 📍 The gradient vectors ∇f (x0,yo …) are also perpendicular to the contour lines of f!

Yeahhhh- multivariate calc can definitely be daunting 😅 Let’s summarize.

The partial derivatives are derivatives of the nth degree that have n number of partial derivatives which isolate each individual variable with the others represented as constants. Gradients just assemble each partial derivative into 1 vector quantity.

LEARNING RATE

The gradient was able to determine the direction to move. The learning rate will determine the size of the step that we take. Learning rate is essentially the hyper-parameter that defines the adjustment in the neural network’s weights with respect to the loss gradient descent.

This parameter determines how fast or slow we move towards the optimal weights while minimizing cost function at every step. High learning rate could cover more area per step but would risk overshooting the minima; low learning rate would take virtually forever to reach the minma.

Either way, this phenomenon can be exemplified through my little nephew, Arnav and his fascination with dogs 🐶 🐶 🐶

Let’s say Arnav’s wildest dreams came true and he sees 25 magnificent labrador retrievers- each of which are black-coloured. Naturally, Arnav would recognize the consistent black colour and associate this as the predominant feature that he looks for when identifying dogs.

Let’s say he was suddenly shown a white-coloured dog and I told Arnav that it was indeed a dog. With a low learning rate, Arnav would continue to believe that all dogs must characterize as black and that this one was simply an outlier 🐼

High learning rate would imply that Arnav would suddenly believe all dogs must be white and any inconsistency with his new hypothesis would be incorrect- even though he say 25 black dogs before.

The desirable learning rate would mean that Arnav realizes that colour is not an important attribute to classifying dogs and proceeds to discover other features. The desired learning rate would be infinitely better as it finds balance between accuracy and time required.

COST FUNCTION

The cost function measures the model’s performance- through the process of training our neural network, we want to ensure that this cost function has been reduced until it has reached a minimum.

Cost function essentially quantifies the total error between predicted and expected values through regression metrics like mean absolute error and mean squared error.

MEAN ABSOLUTE ERROR

Mean absolute error measures the average magnitude of error within a large group of predictions without assessing their direction or vector quantity. This can be modelled through the following equation.

MEAN SQUARED ERROR

Mean squared error finds the average difference squared that exists between predicted and actual values. The same principle of MAE, except these values or squared rather than taken absolute value of. Rather than partial error values that represent the distance between points within the coordinate system, this metric finds partial error that equals the area of the shape (usually square) produced from the distance between measured points.

INTRO TO GRADIENT DESCENT (AGAIN)

Let’s explore an analogy to further understand the intuitive principles of gradient descent!

Imagine that you’ve been placed stranded atop the summit of Mount Everest and you’re tasked with making your way to the bottom- sounds relatively straightforward right?

Well, one small piece of information to consider- you’re completely blind.

That definitely makes the task harder- but not impossible. You’d have to take small steps until you start moving towards the direction of higher incline. This approach would need countless iterations until you’ve reached the bottom, but you would get there eventually.

This essentially emulates the concept of gradient descent where your model backpropagates to eventually reach the lowest point of the mountain.

The mountain akin to the data plotted in space, the step size akin to the learning rate, feeling the incline of surrounding terrain analogous to the algorithm calculating the gradient of the datasets parameter values.

Assuming correctness, the selected direction reduces cost function. The bottom of the mountain represents the optimal value of the machine’s weights (where cost function has been minimized).

LINEAR REGRESSION

For those that are unfamiliar, regression analysis is used throughout all statistical modelling disciplines to investigate relationships between multivariable functions for prediction analysis.

The line that represents the error between expected and experimental values called the regression line- each residual can be portrayed through the vertical lines that connect their variance to the line of best fit.

The following equation shows x as the input training data (uni-variate and one-input variable- the parameter) and y as labels of the data assuming supervised learning.

Let’s put this into perspective with the following example.

Elon works part-time as the marketing director at salesx and has collected data of the amount paid for versus sales for this past year’s promotional efforts to validate future sales and promotion proposals 💡

Elon concludes that the data should be linear and interprets the information into a scatter plot equipped with axes, new customers and amount spent. Elon formed the regression line in order to better understand how many customers salesx would receive with their new marketing ideas.

POLYNOMIAL REGRESSION

Linear regression works beautifully to show structures and trends that exist in two correlated variables of your dataset. However, given the behaviour of linear functions, they wouldn’t be able to accurately reflect the regression for nonlinear relationships that still clearly exhibit some correlation.

Polynomial regressions are able to model relationships between nth degree functions and can fit certain datasets with lower error function values than a linear regression line.

Although polynomial regression provides better fits to the curvature and provides the most accurate representation of the relationship between two variables, they are extremely sensitive to outliers and those can skew the data easily.

Thanks for checking out my article and I hope you learned more about gradient descent and how this principle powers machine learning and artificial intelligence🙏 Would mean everything to me if you did the following!

  1. Send my article some claps 👏
  2. Connect with me on LinkedIn 👈
  3. Follow me on Meduim ✍️
  4. Check out my portfolio for my latest work 💪
  5. Stay updated on my journey and subscribe to my monthly newsletter 🦄

--

--