The CNN that started it all

Published in

DataDrivenInvestor

5 min readMar 18, 2021

AlexNet created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton is perhaps the most influential computer vision architecture to date. An architecture that fueled the spark of modern-day deep learning.

So how does AlexNet work? What is architecture? And what made it do so much better than all the other models?

Alex Net Architecture

The AlexNet Architecture has a deceitfully complicated-looking architecture so let’s break it down.

The architecture is as follows:

Convolutional layer
Max Pooling layer
Convolutional layer
Max pooling layer
Convolutional layer
Convolution layer
Convolutional layer
Fully connected layer
Fully connected layer
Fully connected layer

Convolutional layers

As we can see in AlexNet Convolutional layers are the most common, with 5 of them being present. But what do these convolutional layers do?

On a high level, these convolutional layers take an input (the image, or parts of the image) transforms it, and outputs it to the next layer. Each convolutional layer also includes a filter that is able to detect patterns, such as edges, shapes, texture. The further you go into the network the more complicated the filter becomes, being able to detect entire objects such as cats, dogs, chairs, etc.

The blue square is the input, while the turquoise square is the output

Max Pooling

You may have noticed that the max-pooling layers always come after a convolutional layer. This is no coincidence. The function of max-pooling is to reduce the spatial size of our input.

20, 30, 112, and 37 are the largest numbers in their quadrant

In Max pooling, we do this by specifically taking the largest pixel value. In our example above we can see that the output only contains the largest numbers of each section.

By providing an incomplete and abstract representation of what is going on we can better reduce overfitting. Overfitting is a problem that arises when the machine learning model memorizes the images, instead of generalizing them. Max pooling also makes the process less computationally expensive as the computer has fewer inputs.

Fully Connected Layer

Last but not least we have the fully connected layers. These layers are ultimately what helps classify the input we have to various classes. Using convolutional layers and max-pooling we’ve extracted the necessary data and simplified it —now the fully connected layers classify it.

The architecture of AlexNet is as a whole extremely robust, but what sets it apart from all other architectures? Well, let’s explore some of the features AlexNet has.

ReLU nonlinearity as an activation function.

When dealing with neural networks quality plays an important role but so does speed. As a result, AlexNet used ReLU nonlinearity which is much faster than the tanh function.

As we can see in the diagram above for more extreme values the slope of the tanh function is almost 0. This slows down gradient descent. The ReLU’s slope however stays consistent, which makes it much faster than the tanh function. The same principles goes for the sigmoid function which is similar to the tanh function.

Data Augmentation

Another method AlexNet used to reduce overfitting was Data augmentation. And they approached this a few ways. The first is mirroring.

When the image was mirrored the label remained the same, and as we can see here this is still a picture of a dog. This teaches the architecture to generalize better, and understand that the concept of a dog isn’t limited to specific pixels at specific locations. Rather it’s a collection of broader patterns.

Another way AlexNet was taught that concepts weren’t a result of specific pixels was data cropping.

Each image is obviously still a cat but cropped and shifted. This once again teaches the architecture to better generalize instead of memorizing specific pixel values.

Dropout

Dropout essentially makes it so that there is a chance that a neuron is dropped out. When the neuron is dropped it no longer contributes to the model at all. This means that each time the model trains you get a slightly different architecture. As a result, the model becomes more robust and overfits much less.

Think about it this way, if your goal was to become fit but not muscular what would you do? Well, what you might do is go to the gym and train a different part of your body every day, maybe the arms for one day, maybe the legs for another. Each time you go you train a different part, slowly but surely to get the results you want. However, if you were to go in and constantly train your entire body over time you would have large muscles (overfitting). Certain muscles may also be undertrained (imbalanced model).

This is the same intuition with dropout. By constantly cycling through different sections each neuron becomes equally strong, reduces overfitting, well also getting the results you want.

The Future

By understanding and analyzing AlexNet we also get a good glimpse of current CNNs. Data Augmentation, ReLU non-linearity, and even variations of AlexNet architecture are still widely used to this day. By understanding the intuition behind different components we are also able to free ourselves from the idea that “that’s the way it's always been done.”

This is the best reason to learn history: not in order to predict the future, but to free yourself of the past and imagine alternative destinies. Of course this is not total freedom — we cannot avoid being shaped by the past. But some freedom is better than none.
— Yuval Noah Harari, Homo Deus

Key Takeaways

Convolutional layers with the help of a filter are able to detect increasingly complicated patterns on image data
Max pooling is used to reduce the spatial size of our input
Fully connected layers interpret the input from convolutional layers and classifies the image
ReLU non-linearity as an activation function is faster than the tanh function
Data augmentation through mirroring or shifting helps models generalize
Dropout allows for robust models, trains nodes equally, and reduces overfitting