Recurrent Neural Networks in Deep Learning — Part2

Published in

DataDrivenInvestor

11 min readApr 1, 2020

By Priyal Walpita

Reading this article will help you to understand the terms of Artificial Neural Networks (ANN), Drawbacks seen in ANN, Architecture view of RNN ( Recurrent Neural Networks ), Advantages of using RNN over ANN and how they work as well as how to construct a model of the series and solve various use cases. Intentionally, I kept this article based on theories and their interpretation, focusing primarily on the Recurrent Neural Networks (RNN).

This blog post consists of two sections and this is the second section. The first part walked you through with an introduction to RNN and theories behind it. This section is to discuss the types of RNN and few practical usages.

Note : This article is based on Dr.Andrew Ng’s lectures at Coursera

Different Types Of Recurrent Neural Network (RNN)

There are various forms of RNN in short. They are as follows:

One to One RNN
One to Many RNN
Many to One RNN
Many to Many RNN

We have therefore defined that Recurrent Neural Networks, also known as RNNs, are a class of neural networks that allow for the use of previous outputs as inputs while having hidden states. They are primarily used in the natural language processing and speech recognition sectors. Let’s look at the various types:

One to One RNN

One to One RNN(Tx= Ty=1) is the most basic and traditional form of Neural Network, as you can see in the above picture, giving a single output for a single input.

One to Many

One to Many (Tx=1, Ty>1) is a kind of RNN architecture that is implemented in situations where multiple output is given for a single input. Music generation will be a reference example of its application. RNN models are used in Music generation models to produce a piece of music (multiple output) from a single musical note (single input).

Many to One

Many–to–one architecture of the RNN (Tx>1, Ty=1) is generally used as a common example for model of sentiment analysis. This type of model is used, as the name implies, when multiple inputs are needed to provide one output.

Has General AI Exceeded the Intellectual Capacity of Humans? | Data Driven Investor

Machines are outsmarting humans not only in games but also in the labor market. In many fields today, the use of…

www.datadriveninvestor.com

For example: The model for analyzing Twitter sentiment. A text entry (words as multiple inputs) in that model gives its set feeling (single output). One example this may be the model of film ratings this uses review texts as input to rate a film that can range from 1 to 5.

Many — to — Many

Many — to — many RNN architectures (Tx>1, Ty>1) take multiple inputs and offer multiple outputs, but many — to — many models can be two types as shown above:

Tx = Ty:

That is the position where it applies to the case where layers of input and output are of the same thickness. It can also be interpreted as any output information can be found in Named–Entity Recognition.

2. Tx ! = Ty :

Many–to–architecture can also be represented in models where input and layers are of different sizes, and Machine Translation demonstrates the most common application of this type of RNN architecture. As a consequence, machine translation models are able to return words more or less than the input string due to a non –equivalent number–to–many RNN architecture that operates in the context.

The following section explains the usage of different types of RNN.

Usages of RNN

Language Modeling And Sequence Generation

What Is Language Modeling ?

Modeling of languages is one of the basic activities of natural language processing (NLP). Therefore you can learn how to build a language model using an RNN.

The “Speech Recognition” method is one of the most common use cases in a language model. Assume you are trying to create a program of Speech Recognition. Just imagine it’s like this, it comes across a speech where one word sounded really confusing, and that’s what I mean–

Ex : “The apple and pair salad.”

“The apple and pear salad.”

The speech recognition system therefore could not understand clearly if we were to say the word “pair” or “pear.” But I’m a person who thinks “pear” would make sense of that word, but the problem is how to train the machine to do it accurately.

Now, a successful speech recognition system uses a language model to capture the right sentence (2nd in the example) by measuring the likelihood of both the sentences and choosing the one that is more likely to occur.

So now the problem is how to create a language model of this sort using an RNN. These are the specific measures that have been taken–

The first thing we need is a training set that is a broad corpus of text (set) in a given language (Example we took is in English).
Then if you have a sentence in your training set, first tokenize the sentence and represent it as one–a hot encoded vector (using a dictionary / vocabulary indices value). A vocabulary, just as an example, may be top 10,000 words in English.
Use an additional token as < EOS > to add an extra token where your sentence ends.
If you want punctuation to be tokenized or discarded, you will take a see to that.
If such words (tokens) do not appear in the vocabulary you are working with in your training data, then represent them with a special token as < UNK > for unknown.
Then use RNN to model the chances of the different sequences.

Recurrent Neural Network Architecture Of Speech Recognition

If you see carefully, using a Softmax function, the first step of the first RNN block is trying to predict the likelihood of all of the dictionary terms. In the next step, it attempts to predict the second word “normal” provided the first word “cats” (conditional probability), and continues until it reaches the end of the sentence token.

So, the RNN is learning to predict one term from left to right at a time.

Now to train this neural network we are going to define the cost function i.e. at the specified time phase “t” if the real word is “y” and the expected word is “yhat” .

If you train this RNN on a really big training set it will be able to predict very well the next term. The RNN will be able to predict “sleep” in all likelihood as provided the past words as “Cats average 15 hours of.” Now you’ve trained a strong language model in that process. Therefore, when you get a new phrase with the same source of logic and mathematics, you can predict the correct terms in the sentence.

The example we began previously with, the correct sentence can be predicted as-

“The apple and pear salad”.

You will sample the distribution after you have trained a sequence model to see what it has actually learned.

Sample Out New Sequences From The Trained Language Model

When you have a model of language learned. Give it a shot to see how well it’s trained. One easy way to test is to keep sampling from the distribution to see whether or not it makes any sense.

The way you do that is by taking the RNN that was used for training, and using the softmax function it will estimate the likelihood of all the terms in the dictionary and then randomly select one word (numpy.random.choice)). Then move that word as an input to the 2nd time step (one-hot encoded say) and let the RNN predict the conditional likelihood of having a new word from the dictionary provided the word from the last time step. This continues until the < EOS > token is produced (or sampled). If your vocabulary does not contain this token then you can sample 20 words (typically a sentence) and stop once the number of time step is reached

Sometimes in this particular type procedure it will generate an unknown token <UNK> . But if you are making sure that it will never generate. Then you can delete it and get the rest of the vocabulary tested again.

This is how your RNN language model produces a randomly selected sentence.

And you can also build an RNN character level where your vocabulary would be alphabets and some special characters such as punctuation, numbers 0–9 etc. Similar structure will follow.

One advantage of RNN character level is you’ll never produce a token < UNK>. But the biggest drawback is that you deal with much longer sequences and over a longer period of time, and they are more costly to train computationally. May be that you have a very limited vocabulary, the language model of character level may be useful.

Issue With Recurrent Neural Network (RNNs)

One of the problems with RNN is that it runs into vanishing gradient problems. Let’s see what that means. There are two sentences are –

This restaurant which was opened by my aunt serves authentic Chinese food.
These restaurants which were opened by my good friends serve authentic Indian food.

Now in the example above you can see how the “restaurant” subject has an impact on the “serve” verb This is also known as the subject–verb agreement in English and it is one of the main principles when a sentence is being constructed. Such rules establish long-term dependency between terms (i.e. a word at the beginning of the phrase affects a word at the end of the phrase).

These long term dependencies are not covered by Basic RNN architecture.

This is similar in terms of the “vanishing gradient” problem, where if the neural network is very deep (it has lots of layers) then the gradients (derivatives) measured at the beginning of the prop(last several layers) have a much less impact on the first few layers when propagating the error backwards. So, we’re saying in a way the gradient vanishes as the network is very deep in the process. In turn, therefore, a simple RNN can not catch very long-term dependencies between the words, and thus it fails to detect this subject-verb relationship where subject and verb are far from each other.

Hence the basic RNN architecture has very local influences

So, this “vanishing gradient” problem is dealt with using a special class of RNNs called Gated Recurrent Unit or more commonly known as GRU.

Gated Recurrent Unit (GRU)

It is a modification of the RNN’s hidden layer, so it can identify the long-range dependencies in the sequences and for vanishing gradient problems.

RNN Hidden Layer Unit

Now let’s change this design according to GRU, so we can catch the long-term dependencies. As we read the sentence from left to right, we’ll add another “c” variable for “memory cell” to remember the dependencies such as the agreement with the subject verb. If “restaurant” is singular or plural in our case, and how this affects the verb connected with it.

The main part of the GRU is where you update the “c” memory cell with c based on the “update gate” that determines whether or not to check the memory cell. In our example, “c” will be set to 1 or 0 depending on whether it is singular or plural, then the GRU will memorize this “Ct” value all the way to a verb that is “serve” in our case and will determine the form of it.

We often use another gate called “relevance gate” in a few GRU architectures to better capture long-term dependencies (evolved through work over time).

The range of architecture is being used for those gates also known as LSTM that we are going to look at now.

Long Short — Term Memory (LSTM)

It captures the long-term dependencies just like GRUs in sequences but is more efficient than GRUs. This is a simpler edition of GRUs.

Here “at” is not equal to “ct” i.e. the activation is not equal to the memory cell. We do use two gates, one of which is the “update” and one of which is the “forget” and not a one minus update gate we used in the GRU.

GRUs are simpler than LSTMs because they run faster computationally for larger models but LSTMs are more efficient because they are a general version of GRU and have more gates to control what to hold and what to forget.

The understanding of both certainly gives us the right to choose what question we want to solve.

Bidirectional Recurrent Neural network (RNN)

If you remember when we were debating RNN, we noticed that from left to right it begins reading the tokens. So, it doesn’t catch any of the dependencies as below.

In our entity recognition problem, we tried to find out whether a term is a person’s name or not?

He said, “Teddy bears are on sale !”.

He said, “Teddy Roosevelt was a president!”.

Therefore, in the first sentence the word “Teddy” is a toy and in the second sentence is a part of a name. Therefore, when we have a network that checks terms in both directions (left to right and right to left), such dependencies will not be captured.

So we’ll describe an architecture known as “Bidirectional RNNs” that allows you to take information from both earlier and later in the series at one point in time. Now it may also be a GRU or LSTM unit on the RNN unit instead, but the idea is how to check for sequences from both sides and then make a guess (in our case whether it’s a person or not).

So, it’s an Acyclic Graph. Because of an input sequence, it will calculate “a forward1,” then use that “a-forward2” until the end and at the same time calculate “a-backward4,” “a-backward3” until the beginning. You can then make a prediction of Yhat-1, Yhat-2 till the end after computing all of these secret layers.

One downside of Bi-directional RNN is that you first need the entire data sequence to make some predictions. So, we won’t be able to use it in areas like Speech Recognition because it will mean that the person needs to say something first and then our algorithm will start recognizing the speech that isn’t realistic.

Wrapping These All Up Using A Deep Network

We’ve seen the basic blocks of RNN, GRUs and LSTMs from the very start up to now. Now we need to stack those stuff up and build a deep network or Deep RNNs to learn very complex functions.

Thanks a lot for reading this article. If you have any questions, please ask it in here or reach me via email (priyal@priyal.ai) or from my LinkedIn.