Generative Models

1. Generative Models

Supervised learning learns a function given a dataset of input-output pairs. Unsupervised learning learns patterns in data without explicit labels. Generative models generate new samples from a dataset. They can be used for:

Probability density estimation: learn a distribution to generate new samples. Latent Variable Models (LVMs) can describe sampling of observation as a two-step process: first sample a latent variable from a prior distribution , then sample from a conditional distribution . Then where is a latent variable.
Dimensionality reduction: learn a low-dimensional representation of (often sparse) high dimensional data. Principal Component Analysis (PCA) reduces dimensionality by projecting onto principal component directions that capture the most variance. Probablistic PCA models data as Gaussian with a low-rank covariance structure. Autoencoders learn an encoder function that maps input to a low-dimensional latent space, and a decoder function that reconstructs the input from the latent representation. The model is trained to minimize reconstruction error.
Clustering: discovers group structure in unlabelled data points. A Gaussian mixture model (GMM) models data as a mixture of Gaussian distributions, where each component corresponds to a cluster. The model is trained using the expectation-maximization (EM) algorithm to find the parameters that maximize the likelihood of the data.

2. VAEs

2.1 Divergence Minimization

Divergence minimization is the process of finding a distribution that is close to a target distribution by minimizing a divergence measure ():

2.2 KL Divergence

The Kullback Leibler (KL) divergence is defined as . The KL divergence is not symmetric, i.e., . We can verify easily that the above conditions hold using Jensen's inequality ().

To use KL divergence, we can simplify the expression as . The first term is constant with respect to , so minimizing KL divergence is equivalent to maximizing the expected log-likelihood of the data under the model, i.e., .

In practice, this means computing where .

2.3 Fitting a LVM

To fit a Latent Variable Model (LVM) to a data distribution with MLE.

The MLE function defined above is intractible, so we can optimise for a variational lower bound instead:

(see intro to LVMs).
(introduce a variational distribution ).
(apply Jensen's inequality).
(rearrange terms).

The goal of a VAE is to find , where - a VAE trains the generative model and the variational distribution simultaneously to maximize the variational lower bound on the log-likelihood of the data. In English, the VAE learns to encode data into a latent space and decode from that latent space back to the original data, while ensuring that the latent space follows a specified prior distribution.

However, is still intractable as it requires computing over every possible . We can use Monte Carlo estimation to approximate this expectation:

We can differentiate this to obtain MC gradient w.r.t. , allowing us to learn it. To learn w.r.t. , we can use a reparameterization trick: we can express as a deterministic function of and some noise that is independent of :

Hence, by letting , we can rewrite the expectation as:

Now, we can differentiate this w.r.t. to learn it as well.

2.4 Designing the Distribution

A common choice is a Factorized Gaussian distribution, i.e., : follows distribution with mean and diagonal covariance matrix . and are parameterized by a neural network with parameters . The variance is ensured to be non-negative.

Using this, we can express the an analytic form of the KL regularizer, with , and is the dot product between two vectors.:

2.5 Variational Autoencoders

Combining the above, we want to find , where:

Where:

is the reparameterization of .
is the decoder function that maps latent variable to the data space, parameterized by .
is the stochastic autoencoder that reconstructs the input from the latent representation .
is the reconstruction loss that measures how well the model can reconstruct the input data from the latent representation.
is the KL regularization term that encourages the learned latent distribution to be close to the prior distribution, and prevents from collapsing to zero.
is a noise variable that allows for stochastic sampling of the latent variable during training.

Once trained, we can generate new sample images from the model, by sampling and then computing . Different dimensions of encode different features of the generated image, and we can manipulate these features by changing the corresponding dimensions of .

2.6 Training

To compute :

Initialize and randomly. Also set learning rates , and total iterations for stochastic gradient descent. Then, for , we first (perform approximate posterior inference) then (reconstruct the data):

Sample mini-batch of size as .
Compute and for each . These are the parameters of the variational distribution for each data point in the mini-batch.
Apply the reparameterization trick as where for each . This allows us to sample from the variational distribution in a way that is differentiable with respect to .
Find for each . This is the reconstruction of the input data point from the latent representation .
Finally, we can update network parameters. First, compute variational lower bound as . Then, update network parameters as .

In the variational lower bound, has an analytic form if both and the prior are Gaussian. If there is no analytic form, we can use Monte Carlo estimation to approximate it.

3. GANs

Our goal is once again to fit probabilistic model to an underlying distribution s.t. . We want to minimize a divergence measure: .

A generative adversarial network (GAN) once again samples a latent variable from prior , then passes it through a generator (neural network) to produce sample . Now, a discriminator (neural network) takes in a sample and underlying distribution , to produce a binary value of whether is from the underlying distribution or the generator. The discriminator is trained to distinguish between real and generated samples, while the generator is trained to produce samples that can fool the discriminator.

Our objective function is now a minmax on and :

Discriminator vs Generator

With fixed , training is a supervised learning problem, where we maximize the negative cross entropy loss between the discriminator's predictions and the labels (real vs. fake).

With fixed , training is a reinforcement learning problem, where we minimize the negative log probability of the discriminator being correct, i.e., we want to maximize .

When fixing , the optimal discriminator is , which satisfies . Substituting this back into the objective function, we get:

This is a valid divergence measure, and is known as the Jensen-Shannon (JS) divergence. Hence, minimizing is equivalent to minimizing the JS divergence between and .

3.1 Training

Training a GAN uses a double loop algorithm:

Inner Loop: with fixed , optimise for a few gradient ascent iterations:
Outer Loop: with fixed , optimise for JUST ONE gradient descent iteration:

These two steps are repeated until convergence. The number of inner loop iterations and learning rates are hyperparameters that can be tuned for better performance. In practice, computing is intractable, so we can use Monte Carlo estimation to approximate it by sampling minibatches from the respective distributions:

where .
where .

The full algorithm is:

Initialize , ; learning rates , , and number of inner and outer loop iterations and .
For :
1. For :
  1. Sample minibatches and .
  2. .
2. Sample minibatch .
3. .
4. .

3.2 Non Saturate Loss

We want to maximise the probability of making the wrong decisions on fake data. So, instead of minimising , we can use non saturate loss:

3.3 GAN Implementations

DCGAN replaces pooling layers with strided convolution and fractional strided convolutions. Uses batchnorm. Removes fully connected layers for deeper architectures. Uses ReLU activation in generator and LeakyReLU in discriminator.
LAPGAN starts generation on lower resolution images (), then generate higher resolution images conditioned on the lower resolution image , for . Since multiple discriminators are used, we can train each generator-discriminator pair independently, which is more stable than training a single generator-discriminator pair to generate high resolution images.
Progressive GAN starts with low resolution training images, then progressively increases the resolution by adding new layers to the generator and discriminator. This allows the model to learn coarse features first, then fine details, which can lead to better convergence and higher quality images.
StyleGAN introduces a new generator architecture where latent variable is transformed into a style latent space , which controls generation at every resolution. Fine details generated with noise at different scales.

4. Conditional LVMs

How do we construct conditional latent variable models? I.e. specify a conditional distribution where is some observed variable. For example, generating an input with a specific label.

Now, the goal is to learn generative model where is the data to be geenrated, and is the label that the generation processi s conditioned on. We make the input of the network: . This is a conditional VAE, parameter efficient, works on continuous . To train this model on data , we can use the conditional variational lower bound, derived from MLE:

Now, the encoder must now take as input. The decoder must take as input.

4.1 Conditional GANs

Now, we want to minmax:

5. Diffusion Models

We can make the latent distribution in a VAE more flexible by introducing heirarchical latent variable models. Instead of using 1 latent variable , we use a heirarchy of latent variables :

Starting still from a gaussian prior .
Latent variables are transformed: for .

ELBO (Evidence Lower Bound) learning requires designing . We can do this:

Bottom up approach: , then . This is empirically unstable, behaving inconsistently across datasets and architectures.
Top down approach: . This is more stable, and is the basis of diffusion models.

5.1 Fixed Forward Diffusion Process

In diffusion models, instead of learning the inference distribution , we fix the forward diffusion process . This avoids instability during training caused by chasing a changing posterior in hierarchical VAEs. The process keeps the same dimensionality at every step, i.e. , and gradually adds Gaussian noise to the data. The forward transition is defined as:

which can equivalently be written as:

where is a variance schedule that determines how much noise is added at each timestep. Because this process is linear and Gaussian, the marginal distribution has a closed-form expression, allowing us to sample directly from without iterating through all intermediate steps. This marginal distribution is

and sampling can be written as:

As increases, decreases toward , meaning the signal from gradually disappears and approaches pure Gaussian noise. If we define , then the distribution at time becomes:

which can be interpreted as progressively smoothing the data distribution: starting from the complex data distribution at and gradually transforming it into an approximately standard Gaussian distribution at .

5.2 Reverse Denoising Process

After defining a fixed forward diffusion process , the goal is to learn the reverse process that gradually removes noise and reconstructs the data. Specifically, we design a parameterized model:

that learns to denoise step-by-step until we recover . Using a top-down decomposition of the variational posterior:

we can derive the ELBO objective for training the diffusion model. The resulting variational lower bound becomes:

This objective encourages the learned reverse transition to approximate the true posterior . Intuitively, training teaches the model to iteratively denoise samples, starting from pure Gaussian noise and progressively generating a realistic sample by reversing the diffusion process.

To make learning tractable, we design the reverse model to share the same functional form as the true posterior . The true reverse conditional can be shown to be Gaussian:

Since is unknown during generation, the model instead predicts from using a neural network . This allows us to parameterize the learned reverse transition as

Thus the model learns to estimate the original clean data from a noisy input , and then uses this estimate to compute the mean of the Gaussian denoising step.

Using this parameterization, the KL divergence term in the ELBO simplifies significantly. In particular, the KL term between the true reverse posterior and the learned reverse model becomes proportional to a mean squared error objective. Specifically,

This shows that training the diffusion model can be interpreted as predicting the original clean data from a noisy sample at timestep . Instead of directly minimizing KL divergences between distributions, the learning problem reduces to a simple regression objective where a neural network learns to remove noise from progressively corrupted inputs. This simplification is one of the key reasons diffusion models are stable and effective to train in practice.

5.3 Predicting Noise

A common and more effective parameterization is to have the neural network predict the noise directly rather than predicting . Recall that the forward process can be written as

where . From this expression we can rearrange to estimate if we know the noise:

Thus, instead of predicting , the model learns a neural network that predicts the noise added at timestep . Substituting this estimate into the reverse Gaussian mean gives the learned reverse transition

With this parameterization, the training objective simplifies to predicting the true noise that produced . The loss becomes a simple mean squared error between the true noise and the predicted noise:

This formulation makes diffusion models particularly stable to train: the network simply learns to identify and remove Gaussian noise at different noise levels, enabling the model to iteratively denoise samples from pure noise back to realistic data .

5.4 Training

To train the diffusion model, we can repeat the following until convergence:

Sample from the data distribution.
Let and .
Take a gradient descent step on .

5.5 Sampling

To sample from the trained model, do the following:

Take as input.
For :
1. Sample .
2. Let .
Return as the generated sample.

Choosing Hyperparameters

We must choose hyperparameters such as and , and the amount of steps needed for the generation process.

5.6 Architecture Design

Since our output dimension is the same as the input, we need to use an architecture. A common choice is a U-Net architecture, which has an encoder-decoder structure with skip connections between corresponding layers in the encoder and decoder. Constructing seperate neural networks for each timestep is computationally expensive. Instead, we can condition the same network on .

Vision transforms are also used, which split the image into patches, each as a token, and arranging them into a sequence. This allows the model to capture long-range dependencies and global context in the image, which is important for generating high-quality samples.

Both UNet and ViT are expensive on high resolution images. Instead, we can lift the diffusion process to a lower dimensional latent space. We can train a VAE to learn a latent representation of the data, then apply the diffusion process in this latent space. This is the basis of latent diffusion models, which can generate high-quality images with significantly reduced computational cost compared to pixel-space diffusion models.

5.7 LDMs vs VAEs

At generation time, the sampling distributions used by VAEs and Latent Diffusion Models (LDMs) differ. In a standard VAE, we sample the latent variable directly from the prior . In contrast, latent diffusion models approximately sample from the aggregated posterior .$, where the aggregated posterior is defined as:

This difference matters because of how the ELBO objective is optimized during VAE training. The ELBO is:

During training, the decoder only sees latent samples drawn from the encoder distribution . As a result, the decoder effectively learns to reconstruct data from latent codes distributed according to , not necessarily the prior .

However, at generation time in a VAE we still sample and can differ significantly from even after training. This mismatch can cause generated samples to be lower quality or unrealistic, because the decoder receives latent codes it was not well trained to decode.

Latent diffusion models mitigate this issue by learning a generative process that better matches the aggregated posterior distribution, reducing the mismatch between the latent codes used during training and those used during generation.

Back to Home

Table of Contents