Generative Models
1. Generative Models
Supervised learning learns a function
- Probability density estimation: learn a distribution
to generate new samples. Latent Variable Models (LVMs) can describe sampling of observation as a two-step process: first sample a latent variable from a prior distribution , then sample from a conditional distribution . Then where is a latent variable. - Dimensionality reduction: learn a low-dimensional representation of (often sparse) high dimensional data. Principal Component Analysis (PCA) reduces dimensionality by projecting onto principal component directions that capture the most variance. Probablistic PCA models data as Gaussian with a low-rank covariance structure. Autoencoders learn an encoder function
that maps input to a low-dimensional latent space, and a decoder function that reconstructs the input from the latent representation. The model is trained to minimize reconstruction error. - Clustering: discovers group structure in unlabelled data points. A Gaussian mixture model (GMM) models data as a mixture of Gaussian distributions, where each component corresponds to a cluster. The model is trained using the expectation-maximization (EM) algorithm to find the parameters that maximize the likelihood of the data.
2. VAEs
2.1 Divergence Minimization
Divergence minimization is the process of finding a distribution
2.2 KL Divergence
The Kullback Leibler (KL) divergence is defined as
To use KL divergence, we can simplify the expression as
In practice, this means computing
where .
2.3 Fitting a LVM
To fit a Latent Variable Model (LVM) to a data distribution with MLE.
The MLE function defined above is intractible, so we can optimise for a variational lower bound instead:
(see intro to LVMs). (introduce a variational distribution ). (apply Jensen's inequality). (rearrange terms).
The goal of a VAE is to find
However,
We can differentiate this to obtain MC gradient w.r.t.
Hence, by letting
Now, we can differentiate this w.r.t.
2.4 Designing the Distribution
A common choice is a Factorized Gaussian distribution, i.e.,
Using this, we can express the an analytic form of the KL regularizer, with
2.5 Variational Autoencoders
Combining the above, we want to find
Where:
is the reparameterization of . is the decoder function that maps latent variable to the data space, parameterized by . is the stochastic autoencoder that reconstructs the input from the latent representation . is the reconstruction loss that measures how well the model can reconstruct the input data from the latent representation. is the KL regularization term that encourages the learned latent distribution to be close to the prior distribution, and prevents from collapsing to zero. is a noise variable that allows for stochastic sampling of the latent variable during training.
Once trained, we can generate new sample images from the model, by sampling
2.6 Training
To compute
Initialize
- Sample mini-batch of size
as . Compute and for each . These are the parameters of the variational distribution for each data point in the mini-batch. Apply the reparameterization trick as where for each . This allows us to sample from the variational distribution in a way that is differentiable with respect to . Find for each . This is the reconstruction of the input data point from the latent representation . - Finally, we can update network parameters. First, compute variational lower bound as
. Then, update network parameters as .
In the variational lower bound,
has an analytic form if both and the prior are Gaussian. If there is no analytic form, we can use Monte Carlo estimation to approximate it.
3. GANs
Our goal is once again to fit probabilistic model
A generative adversarial network (GAN) once again samples a latent variable
Our objective function is now a minmax on
Discriminator vs Generator
- With fixed
, training is a supervised learning problem, where we maximize the negative cross entropy loss between the discriminator's predictions and the labels (real vs. fake). - With fixed
, training is a reinforcement learning problem, where we minimize the negative log probability of the discriminator being correct, i.e., we want to maximize .
When fixing
This is a valid divergence measure, and is known as the Jensen-Shannon (JS) divergence. Hence, minimizing
is equivalent to minimizing the JS divergence between and .
3.1 Training
Training a GAN uses a double loop algorithm:
- Inner Loop: with fixed
, optimise for a few gradient ascent iterations: - Outer Loop: with fixed
, optimise for JUST ONE gradient descent iteration:
These two steps are repeated until convergence. The number of inner loop iterations and learning rates are hyperparameters that can be tuned for better performance. In practice, computing
where . where .
The full algorithm is:
- Initialize
, ; learning rates , , and number of inner and outer loop iterations and . - For
: For : - Sample minibatches
and . .
- Sample minibatches
Sample minibatch . . .
3.2 Non Saturate Loss
We want to maximise the probability of making the wrong decisions on fake data. So, instead of minimising
3.3 GAN Implementations
- DCGAN replaces pooling layers with strided convolution and fractional strided convolutions. Uses batchnorm. Removes fully connected layers for deeper architectures. Uses ReLU activation in generator and LeakyReLU in discriminator.
- LAPGAN starts generation on lower resolution images (
), then generate higher resolution images conditioned on the lower resolution image , for . Since multiple discriminators are used, we can train each generator-discriminator pair independently, which is more stable than training a single generator-discriminator pair to generate high resolution images. - Progressive GAN starts with low resolution training images, then progressively increases the resolution by adding new layers to the generator and discriminator. This allows the model to learn coarse features first, then fine details, which can lead to better convergence and higher quality images.
- StyleGAN introduces a new generator architecture where latent variable
is transformed into a style latent space , which controls generation at every resolution. Fine details generated with noise at different scales.
4. Conditional LVMs
How do we construct conditional latent variable models? I.e. specify a conditional distribution
Now, the goal is to learn generative model
Now, the encoder must now take
4.1 Conditional GANs
Now, we want to minmax:
5. Diffusion Models
We can make the latent distribution
- Starting still from a gaussian prior
. - Latent variables are transformed:
for .
ELBO (Evidence Lower Bound) learning requires designing
- Bottom up approach:
, then . This is empirically unstable, behaving inconsistently across datasets and architectures. - Top down approach:
. This is more stable, and is the basis of diffusion models.
5.1 Fixed Forward Diffusion Process
In diffusion models, instead of learning the inference distribution
which can equivalently be written as:
where
and sampling can be written as:
As
which can be interpreted as progressively smoothing the data distribution: starting from the complex data distribution at
5.2 Reverse Denoising Process
After defining a fixed forward diffusion process
that learns to denoise
we can derive the ELBO objective for training the diffusion model. The resulting variational lower bound becomes:
This objective encourages the learned reverse transition
To make learning tractable, we design the reverse model
Since
Thus the model learns to estimate the original clean data
Using this parameterization, the KL divergence term in the ELBO simplifies significantly. In particular, the KL term between the true reverse posterior and the learned reverse model becomes proportional to a mean squared error objective. Specifically,
This shows that training the diffusion model can be interpreted as predicting the original clean data
5.3 Predicting Noise
A common and more effective parameterization is to have the neural network predict the noise
where
Thus, instead of predicting
With this parameterization, the training objective simplifies to predicting the true noise that produced
This formulation makes diffusion models particularly stable to train: the network simply learns to identify and remove Gaussian noise at different noise levels, enabling the model to iteratively denoise samples from pure noise
5.4 Training
To train the diffusion model, we can repeat the following until convergence:
- Sample
from the data distribution. - Let
and . - Take a gradient descent step on
.
5.5 Sampling
To sample from the trained model, do the following:
- Take
as input. - For
: - Sample
. - Let
.
- Sample
- Return
as the generated sample.
Choosing Hyperparameters
We must choose hyperparameters such as
and , and the amount of steps needed for the generation process.
5.6 Architecture Design
Since our output dimension is the same as the input, we need to use an
Vision transforms are also used, which split the image into patches, each as a token, and arranging them into a sequence. This allows the model to capture long-range dependencies and global context in the image, which is important for generating high-quality samples.
Both UNet and ViT are expensive on high resolution images. Instead, we can lift the diffusion process to a lower dimensional latent space. We can train a VAE to learn a latent representation of the data, then apply the diffusion process in this latent space. This is the basis of latent diffusion models, which can generate high-quality images with significantly reduced computational cost compared to pixel-space diffusion models.
5.7 LDMs vs VAEs
At generation time, the sampling distributions used by VAEs and Latent Diffusion Models (LDMs) differ. In a standard VAE, we sample the latent variable directly from the prior
This difference matters because of how the ELBO objective is optimized during VAE training. The ELBO is:
During training, the decoder
However, at generation time in a VAE we still sample
Latent diffusion models mitigate this issue by learning a generative process that better matches the aggregated posterior distribution, reducing the mismatch between the latent codes used during training and those used during generation.