RNNs

A recurrent neural network (RNN) works on sequential data (e.g. time series data, bounding boxes in videos, machine translation). An RNN maintains a recurrent state , where is the previous state, is the current input, and is an activation function. The output of the RNN is then .

1. Training

The loss function is , where are the parameters of the RNN. To compute the loss gradient , we will need:

This is called backpropagation through time (BPTT). By expansion of , we get:

Depending on the non-linearity of , the term can either vanish or explode, which is called the vanishing/exploding gradient problem, especially when .

Also, this means the dependency of on gets harder to learn as increases, which is called the long-term dependency problem.

2. LSTMs

A long short term memory (LSTM) performs better than RNNs in learning long-term dependencies. Each cell now has a cell state in addition to the recurrent state . At each time step, the LSTM computes:

Forget gate: Determines what information to discard from the cell state. , where is the sigmoid function, and denotes concatenation. It depends on the previous hidden state and the current input .
Input gate: Determines what new information to add to the cell state. .
Candidate cell state update gate: Computes a candidate update to the cell state: .
Cell state update: The new state , where denotes element-wise multiplication. Here, and control how much of the previous cell state to keep and how much of the candidate update to add.
Hidden state update: The new hidden state where output gate controls how much of the cell state to output.

Prediction of is the same as RNNs: .

2.1 BPTT in LSTMs

Now, we require computing . Since is a sigmoid function, it is in the range of , which can help prevent exploding gradients. Also, the term can be small, which can help prevent vanishing gradients. Therefore, LSTMs can better capture long-term dependencies than RNNs.

2.2 GRUs

Gated Recurrent Unit (GRU) is a simplified version of LSTM that combines the forget and input gates into a single update gate. GRU is faster to compute. There is no conclusive evidence that LSTM or GRU performs better than the other in practice, so start with LSTMs and switch to GRUs if you want a faster model, or LSTM has overfitting issues.

2.3 Stacking LSTMs

We can stack multiple LSTM layers to increase the model capacity. The output of the -th layer at time step is fed as input to the -th layer at the same time step . This allows the model to learn more complex representations of the sequential data. We don't need to wait for the previous LSTM layer to finish its forward pass before starting the next layer, since they are processing the same time step. This is called stacked LSTM.

Having some layers with reverse direction (i.e. processing the sequence in reverse order) can help capture dependencies from both directions, which is called bidirectional LSTM. However, we need to wait for the previous LSTM layer to finish its forward pass before starting the next layer, since they are processing different time steps. This is called stacked bidirectional LSTM.

3. Applications

3.1 Sequence to Sequence Models

Machine Translation can be done to take input in the source language and output : i.e. learn a conditional distribution . To do this, we can define an autoregressive model:

A sequence encoder summarizes the information in the input sequence. This is done by:

Mapping each input token to an embedding vector , and add an ending token.
Use an LSTM / Stacked LSTM to process each embedding vector sequentially, and take the final hidden state as the sequence encoding .
The last LSTM is used to produce the probability distribution of .
This can then be passed through more LSTms to produce the probability distribution of , and so on.

We need to pass in as inputs to the model. In training, is the provided output in the supervision sequence. In testing, is the output generated by the model at the previous time step. This is called teacher forcing.

3.2 Image Captioning

The goal here is to generation caption from image . This works by:

Using a CNN encoder to extract a feature vector from the image .
Using an LSTM decoder to generate the caption from the feature vector .

3.3 Sequence Generation Models

What if we want to generate a sequence without any input? We define a latent variable , and a conditional distribution parameterized by a neural network . Then we can use this model to generate a sequence with .

To do this, we can build an autoregressive model (meaning the output at time step depends on the previous outputs) using the generator:

Back to Home

Table of Contents