Unsupervised Learning

Supervised learning assumes we have labeled data, but in many cases, labels are not available. Unsupervised learning techniques help us find patterns and structures in unlabeled data. We could

Try to determine groups of similar data points (clustering).
Estimate the underlying probability distribution of the data (density estimation).
Reduce the dimensionality of the data while preserving important features (dimensionality reduction).

1. Clustering

A cluster is a set of instances similar to each other but dissimilar to instances in other clusters. Clustering is grouping instances (in some feature space) into clusters.

1.1 K-Means Clustering

With K-Means Clustering we try to split data into clusters with the following steps:

Initialisation: randomly select initial cluster centroids.
Assignment: assign each data point to the nearest cluster centroid.
Update: recalculate the centroids as the mean of all points assigned to each cluster.
If we have not converged, repeat Assignment and Update steps.

More formally:

Ranodomly select initial centroids .
Assignment: .
Update: where .
Convergence check: for some small .

K-Means aims to minimize the within-cluster sum of squares (WCSS). This is the average distance between each training example and its assigned cluster centroid:

1.2 Finding K

The elbow method:

Run -means multiple times with different .
Keep track of cost for each .
Plot against and look for the "elbow" point where the decrease in cost starts to slow down. This point indicates a good choice for .

We could also use cross validation.

1.3 K-Means Performance

K-means is simple and efficient (linear complexity for iterations, clusters and data points). However:

must be specified in advance.
K-means finds a local optimum (sensitive to initial centroids).
Only works when a distance function exists.
Very sensitive to outliers.
Does not handle hyper-ellipsoidal clusters or hyper-spherical clusters.

2. Probability Density Estimation

A probability density function (PDF) models how likely is to be observed within a particular interval. We have and . The goal of probability density estimation is to estimate the PDF from data.

2.1 Non-Parametric Methods

These don't make assumptions about the data distribution. The number of parameters grows with the data. This gives low bias but high variance.

2.1.1 Histogram Estimation

Histograms are a simple way to estimate PDFs:

Divide the data range into equal-width bins.
Count the number of data points in each bin.
Estimate the PDF as the normalized counts in each bin.
The choice of bin width affects the estimate: too wide loses detail, too narrow introduces noise.

2.1.2 Kernel Density Estimation

Another method is Kernel Density Estimation (KDE), where we compute by looking at training examples within a kernel function :

Where is the number of training examples, is the bandwidth (window size) and is the number of dimensions. A simple kernel function is the uniform kernel:

Another common kernel is the Gaussian kernel:

Increasing the bandwidth smooths the estimate, while decreasing makes it more sensitive to noise.

2.2 Parametric Methods

Parametric methods make simplified assumptions about the form, and estimate a fixed number of parameters. However, this gives high bias but low variance.

2.2.1 Univariate Gaussian Distribution

A common assumption is that the data follows a Univariate Gaussian distribution:

In this case we can find by fitting and to the training set:

2.2.2 Multivariate Gaussian Distribution

The Multivariate Gaussian distribution generalizes the univariate case to multiple dimensions:

Then we can find by fitting and to the training set:

2.2.3 Mixture Models

A Mixture Model gives us a better bias-variance tradeoff by combining multiple distributions:

A Gaussian Mixture Model (GMM) is most popular:

2.3 Likelihood

Likelihood quantifies how well a model has fit the data, by measuring the probability of observing from a dataset:

where are the model parameters. We often use the negative log-likelihood for optimisation, turning it into a minimisation problem:

When we perform Gaussian fitting, we are actually minimising log likelihood! This can be proven by setting and to find the minima.

3. GMM-EM Algorithm

GMMs can model complex distributions but have lots of parameters to optimise: .

The GMM-EM algorithm fits a Gaussian Mixture Model to data using the Expectation-Maximization (EM) algorithm:

Initialisation: Choose a for the number of components, and randomly initialise parameters .
Expectation: Compute the responsibilities for each data point and component :
Maximization: For each component: update the means , covariances , and mixture proportions :
If not converged (no significant change in or log-likelihood), repeat Expectation and Maximization steps.

Just like K-means, GMM-EM converges to local optima.

3.1 Choosing K

To find we commonly minimise the Bayesian Information Criterion (BIC), which takes into account the negative log-likelihood and model complexity (Occam's razor):

where is the number of parameters for 2D gaussians.

Once again, we could also use cross validation to choose :

Split into training and validation sets.
Fit GMM-EM on training set for different .
Evaluate log-likelihood on validation set.

3.2 GMM vs K-Means

In both cases:

We specify clusters/components.
Convergence happens when assignments/parameters stabilise.
Sensitive to initialisation.

Often K-means is run first to initialise GMM-EM.

GMM-EM does soft clustering - encodes to what degree each data point belongs to each cluster, while K-means does hard clustering - assigns each data point to exactly one cluster.

GMM-EM can also generate clusters with different probabilities and non-spherical clusters.

Back to Home

Table of Contents