How to Train an Artificial Brain

Dickson Wu
DataDrivenInvestor
Published in
8 min readSep 15, 2020

--

What’s the most flexible thing in this universe? A slinky? An acrobatic? Stretch Armstrong? Wrong! It’s actually something you’re using right now — the 🧠!

Just think about it, we humans are able to do crazy things! We can calculate the motion of heavenly bodies, diagnose diseases, and invent new technologies, all with the human mind. In fact, the hardware in your skull can do anything that any other human has done before!

So let’s transfer that same power to an artificial system. That’s where Neural Networks (NN) come in! They have comparable power and flexibility to ours! NNs can diagnose cancer, beat fighter pilots at dog-fights, and make beautiful music!

But how do brains and NNs reach the point where they can master such specialized tasks? Through training of course! 💪

Note: To follow along with this article, it’s 🔑 to have a solid understanding of how NNs work! Click here for my article on how NNs work!

Hold on, what’s actually training?

When we train our brains, we learn how to do calculus or how to flip a pancake. From the outside, we look the same! But on the inside, our brain has rewired neurons so we can do those things.

In the same respect, NNs are also rewiring its parameters by either increasing or decreasing its value.

A neuron has thousands of connections that it has to adjust

The Process to Train Parameters:

When training our parameters we want everything to be automated for us. We can’t be manually fiddling around with the parameters (that would take forever). Instead, we get the computer to do it for us, with a Loss Function and Stochastic Gradient Descent (SGD).

Loss Function:

An important part of any training is feedback! You need the coach to tell you what you got right and what you got wrong. For NNs the coach is the loss function!

The loss function evaluates the performance of the NN. It takes in the predictions of the NN and the target, then computes a loss. There are many types of losses but for all loss functions: the lower the loss, the better!

The Intuition Behind SGD:

Imagine you're playing a game:

  • You’re in a nice hilly landscape 🌄
  • You have blindfolds on
  • You are placed at a random location
  • Goal: To get to the lowest point in the landscape

Which direction are you going to go? You want to get to the bottom as quickly as possible. That would be the direction of the slope of the hill! Take a step in that direction, re-find the slope, then repeat! Keep on going and you’ll eventually find the bottom! 🙌

Replace the loss function with the grassland, and the parameter with you, and you’ve got SGD! The parameters start off at some random place on the loss field. Then they feel around, find the steepest part and go in that direction. They keep on going until they find the bottom of the loss field!

Training = Get parameters to the bottom! Training is over when you’ve got parameters that are acceptably close to the bottom.

🔑 Don’t be intimidated by these big fancy names. 99% of the time they can be simplified down! (SGD → The way to find the bottom of the hill!)

The Loss Field

The Math Behind SGD:

So how are we going to represent SGD mathematically? Steepness = Derivative! To find the steepest part of the loss field you take the partial derivative of the loss function with respect to your parameter. Then you multiply the derivative by the learning rate (which will be explained later) to create the step. You then subtract the step from the parameter to get your updated parameter.

You’ve got to go through this process for each parameter over and over again. But don't worry! Computers are fast and they can calculate the derivative automatically for us!

Formula for SGD
  • θⱼ = Any parameter
  • α = Learning rate
  • J(θ) = Loss function

Learning Rates:

Learning rates are the difference between training your model in 5 epochs versus 100 epochs! 1 epoch is a whole cycle through your training data. Be attentive when choosing them! 👀

Sticking with the analogy, learning rates are how big your steps are. They could be giant strides (high learning rate) or tiny little baby steps (small learning rate). On the math side, the learning rate is what you multiply the derivative by in order to get your step. 👟

Choosing the learning rate is the tricky part. But when it comes down to it, just experiment and see which learning rate works best!

Down below are some nice GIFs I made myself to show you what happens if you choose learning rates that are: perfect, too low, too high, and way too high!

Note on GIFs: The green parabola represents the hill/loss field. The little jumps represents the steps that we’re taking. The flag represents the bottom of the loss function.

Perfect Learning Rate:

This converges quickly to the minimum! The step size naturally decreases because the derivative at each point gets closer and closer to 0. 👍

The perfect learning rate

Too Low Learning Rate:

When the learning rate is too low it takes forever to get to the bottom. This is because we only take tiny little steps that hardly budge us at all. As a result, you have to train for more epochs, which increases the risk of overfitting. 👎

The Learning rate is too low

Too High Learning Rate:

When the learning rate is too high it also takes forever to get to the bottom. This is because the steps constantly overshoot the minimum. This takes too many epochs, which runs the risk of overfitting. 👎

The Learning Rate is too high

Way too High Learning Rate:

When our learning rate is way too high, we diverge and never get to the answer. You can tell this is happening when your losses exponentially increase! 👎 👎 👎

The Learning Rate is way too high

Metric vs Loss Function:

When you’re training the NN with PyTorch or TensorFlow these 2 columns always appear. So what are they? They’re similar in the way that they provide feedback on how the NN is doing, but they serve 2 different purposes:

Metric:

  • Arbitrary measure for a human to see and check it’s performance
  • Picked intuitively
  • Can either increase or decrease depending on the type of metric

Loss Function:

  • A measure used by computers for SGD
  • Picked for it’s similarity to our goal and a function that can be differentiated
  • Should always decrease

Wait so why can’t they be the same thing? Well, they can, but more often than not, they aren’t. An example will help clarify why. Let’s say our task is to classify images of numbers into digits. The metric would be the accuracy of our NN (#Correct/Total).

If we used this as our loss function, we would have some serious problems. First of all, it’s not even differentiable. Even if it were, it would still be horrible. This is because if we change all the parameters by 0.001% our outputs would be the same. When they’re the same, we have a big problem! Our derivative is 0.

This means we’re stuck. The step would always be equal to 0. We can’t move, and there’s no way out. Even if we restart and increase the learning rate, we still have a huge chance of getting stuck! 😖

So we don’t use accuracy for the loss function. Instead, we’ll use something like Binary Cross-Entropy Loss. It’s differentiable and tiny changes in the parameters will change the loss. 😄

MNIST Dataset, The Hello World Of AI

Batching:

Alrighty, final topic! So far, we’re updating the parameters every time we run the NN and compare the predictions with the targets. But is this the optimal way of doing it? Actually, no! It’s computationally expensive and gives bad results! 👎

First of all, if you have 7000 training samples, the program has to run 7000 times for a single epoch! Each time there are thousands of parameters updated each time, each of those parameters needing dozens of operations to find the derivative!

Computers are fast, but that’ll take a long time. Plus only training one sample at a time causes a bumpy loss field. That’s because when it’s training it will always be 100% correct or 0% correct. This causes the NN to train less accurately and leads to bad results. 😖

So how can we fix it? We could batch multiple samples together and train the NN on the whole batch before updating the parameters. This solves the speed problem because computers can run through the whole batch in parallel! Additionally, the loss field is less bumpy because the loss is all averaged out (it’s no longer 100% right or 0% right all the time). 😄

🔑Takeaways:

  • NNs train by adjusting the parameters 🔼 and 🔽
  • Training is automatic thanks to the Loss Function and SGD
  • The Loss Function is like a hilly field, where SGD is the process to get to the bottom
  • Picking the right learning rate is important
  • Batching speeds up training and increases accuracy

Thanks for reading! I’m Dickson, 17 years old tech enthusiast who’s excited to accelerate myself to impact billions of people 🌎

If you want to follow along on my journey, you can join my monthly newsletter, check out my website, and connect on Linkedin or Twitter 😃

Gain Access to Expert View — Subscribe to DDI Intel

--

--

Hi I’m Dickson! I’m an 18-year-old innovator who’s excited to change the course of humanity for the better — using the power of ML!