RNNs
A recurrent neural network (RNN) works on sequential data (e.g. time series data, bounding boxes in videos, machine translation). An RNN maintains a recurrent state
1. Training
The loss function is
This is called backpropagation through time (BPTT). By expansion of
Depending on the non-linearity of
, the term can either vanish or explode, which is called the vanishing/exploding gradient problem, especially when .
Also, this means the dependency of
on gets harder to learn as increases, which is called the long-term dependency problem.
2. LSTMs
A long short term memory (LSTM) performs better than RNNs in learning long-term dependencies. Each cell now has a cell state
- Forget gate: Determines what information to discard from the cell state.
, where is the sigmoid function, and denotes concatenation. It depends on the previous hidden state and the current input . - Input gate: Determines what new information to add to the cell state.
. - Candidate cell state update gate: Computes a candidate update to the cell state:
. - Cell state update: The new state
, where denotes element-wise multiplication. Here, and control how much of the previous cell state to keep and how much of the candidate update to add. - Hidden state update: The new hidden state
where output gate controls how much of the cell state to output.
Prediction of
2.1 BPTT in LSTMs
Now, we require computing
2.2 GRUs
Gated Recurrent Unit (GRU) is a simplified version of LSTM that combines the forget and input gates into a single update gate. GRU is faster to compute. There is no conclusive evidence that LSTM or GRU performs better than the other in practice, so start with LSTMs and switch to GRUs if you want a faster model, or LSTM has overfitting issues.
2.3 Stacking LSTMs
We can stack multiple LSTM layers to increase the model capacity. The output of the
Having some layers with reverse direction (i.e. processing the sequence in reverse order) can help capture dependencies from both directions, which is called bidirectional LSTM. However, we need to wait for the previous LSTM layer to finish its forward pass before starting the next layer, since they are processing different time steps. This is called stacked bidirectional LSTM.
3. Applications
3.1 Sequence to Sequence Models
Machine Translation can be done to take input
A sequence encoder
- Mapping each input token
to an embedding vector , and add an ending token. - Use an LSTM / Stacked LSTM to process each embedding vector
sequentially, and take the final hidden state as the sequence encoding . - The last LSTM is used to produce the probability distribution of
. - This can then be passed through more LSTms to produce the probability distribution of
, and so on.
We need to pass in
3.2 Image Captioning
The goal here is to generation caption
- Using a CNN encoder to extract a feature vector
from the image . - Using an LSTM decoder to generate the caption
from the feature vector .
3.3 Sequence Generation Models
What if we want to generate a sequence without any input? We define a latent variable
To do this, we can build an autoregressive model (meaning the output at time step