Learn how to build a fully functional deep neural network in an efficient way using Julia.

How to Build a Deep Neural Network from Scratch with Julia:

Published in

DataDrivenInvestor

6 min readSep 1, 2019

Deep Learning is the most promising field of artificial intelligence with proven success in areas ranging from computer vision to natural language processing. In this article, you will learn how to build a neural network with no prior domain knowledge.

Requirements:

Linear Algebra
Basic Calculus
Programming

The Challenge of Forex Trading for Machine Learning | Data Driven Investor

Machine learning is a branch of artificial intelligence that has grabbed a lot of headlines previously. People are…

www.datadriveninvestor.com

What is a Neural Network?

A Neural Network (NN) can be defined as a computing system which learns patterns from data, trying to mimic how biological neurons share information. A NN is conformed by an Input Layer, L hidden layers and an Output Layer.

Neural Network with a single Hidden Layer

Every layer is made up of neurons. Every neuron has two sets of parameters, weights and biases, whose function will become clearer after discussing our implementation.

When input data comes into a neuron, it computes a linear function using weights and biases as parameters. That is:

Where X , W and b are the input data, weights and biases respectively. If a NN only performed this calculation, it would be analogous to linear regression. In order to learn non-linear patterns, it is necessary to apply a different function after computing Z. This is called the activation function:

This process is repeated in every neuron throughout the network, until it reaches the output layer.

We can compare the output from the NN with the target data Y by defining a cost function. An example of a cost function (to illustrate its purpose) is the mean square error (MSE):

Where y_hat is the output from our NN. The cost function is a measure of how much our prediction differs from reality. The objective of our neural network is to approximate the predictions to the target data as much as possible by means of minimizing the cost function and learning the optimal parameters.

Weights and biases are learned by means of Backpropagation Algorithm. In mathematics, minimizing a function entails taking derivatives (gradients) of the function. If the function is a composite of several functions, it is possible to use the chain rule to compute the derivatives with respect to different parameters.

After computing the derivatives with respect to the parameters, we can update them by using Gradient Descent algorithm, which is an iterative optimization algorithm to find the minimum of a function. Intuitively, gradient descent can be visualized as:

Gradient Descent converging to a local minimum after 4 iterations.

This process is repeated systematically until our NN achieves high accuracy. From this point, we will denote the computation of the output as Forward Step and the computation of the gradients as Backward Step. In order to consolidate our understanding of NN, we will build a neural network from scratch with Julia, focusing on a concrete problem.

Example:

Our objective will be to teach our NN to perform binary classification. That is, given input data X we want it to predict the target Y (made of 0's or 1's). We will build a L-hidden layers model, using ReLu and Sigmoid as activation functions.

ReLu and Sigmoid functions can be defined in Julia with a vectorized implementation as:

Given this problem, we will explain how to implement the forward and backwards step so that eventually we will have a fully functional neural network.

Parameter Initialization:

First, we must initialize the parameters of our NN. It is necessary that the weights are initialized randomly in order to facilitate symmetry-breaking. Symmetry-breaking is a concept in mathematics that emerges in optimization problems. Remember that eventually we are trying to optimize a function and with different weights, there is a greater chance of reaching the lowest point.

Forward Step:

The next step is computing the forward propagation step for our L-layers Neural Network. We will use auxiliary functions in order to better understand the different process that composes this step. The linear part of the process can be computed as:

Notice that information in every step will be stored in a variable “cache” which will allow us to speed up the backward propagation step.

After computing the linear step, we will apply either ReLu or Sigmoid functions to the output:

In order to obtain the final outcome from our model, it is necessary to iterate over every layer, feeding the output from each layer as an input for the next one.

After obtaining an output from our model, we can compute a cost function, which will determine the optimization problem that we are trying to solve. For this example, Binary Cross-Entropy will be used, defined as:

Our learning algorithm will be driven by the minimization of the cost function. The parameters will be tuned such that J is minimized.

Backpropagation Step:

Backpropagation is arguably the most confusing part of training a neural network, we will break it down to several functions in order to have a deeper understanding of what it is actually doing.

Remembering that for each layer the linear part is calculated as:

Given the derivative with respect to Z (which will be referred to as dZ), we want to compute the derivatives with respect to the parameters and the activation part: dW , db and dA:

The reader well-versed in calculus can try to derive this formula from scratch by applying the chain rule. Implementing a vectorized version in Julia using the caches from the forward step:

The derivative with respect to Z can be computed with the following formula:

This requires to compute the derivative of the activation function. We will build two functions to compute a vectorized version of ReLu and sigmoid derivatives:

Integrating linear and activation derivatives in a single function:

The last step needed is to calculate the derivative of the cost function with respect to the output of the last layer. In this case, it is the derivative of Binary Cross-Entropy with respect to the sigmoid function:

In every iteration, we must compute the gradients of each parameter in every layer, starting from the output layer. The gradients are stored in a dictionary which will be used to update the parameters.

Updating the parameters is done by means of Gradient Descent algorithm:

Where alpha is the learning rate, a hyperparameter that must be selected after experimenting with different values.

Summary:

Gathering the different steps we can write down a train_nn function, which performs the forward and backwards steps together for a given number of iterations. This is another hyperparameter that must be empirically selected.

Tackling our binary classification problem, we can plot the cost function and the accuracy versus the number of iterations to ensure that our learning algorithm is working properly: