Unsupervised Learning
Supervised learning assumes we have labeled data, but in many cases, labels are not available. Unsupervised learning techniques help us find patterns and structures in unlabeled data. We could
- Try to determine groups of similar data points (clustering).
- Estimate the underlying probability distribution of the data (density estimation).
- Reduce the dimensionality of the data while preserving important features (dimensionality reduction).
1. Clustering
A cluster is a set of instances similar to each other but dissimilar to instances in other clusters. Clustering is grouping instances (in some feature space) into clusters.
1.1 K-Means Clustering
With K-Means Clustering we try to split data into
- Initialisation: randomly select
initial cluster centroids. - Assignment: assign each data point to the nearest cluster centroid.
- Update: recalculate the centroids as the mean of all points assigned to each cluster.
- If we have not converged, repeat Assignment and Update steps.
More formally:
- Ranodomly select
initial centroids . - Assignment:
. - Update:
where . - Convergence check:
for some small .
K-Means aims to minimize the within-cluster sum of squares (WCSS). This is the average distance between each training example and its assigned cluster centroid:
1.2 Finding K
The elbow method:
- Run
-means multiple times with different . - Keep track of cost
for each . - Plot
against and look for the "elbow" point where the decrease in cost starts to slow down. This point indicates a good choice for .
We could also use cross validation.
1.3 K-Means Performance
K-means is simple and efficient (linear complexity
must be specified in advance. - K-means finds a local optimum (sensitive to initial centroids).
- Only works when a distance function exists.
- Very sensitive to outliers.
- Does not handle hyper-ellipsoidal clusters or hyper-spherical clusters.
2. Probability Density Estimation
A probability density function (PDF)
2.1 Non-Parametric Methods
These don't make assumptions about the data distribution. The number of parameters grows with the data. This gives low bias but high variance.
2.1.1 Histogram Estimation
Histograms are a simple way to estimate PDFs:
- Divide the data range into
equal-width bins. - Count the number of data points in each bin.
- Estimate the PDF as the normalized counts in each bin.
- The choice of bin width affects the estimate: too wide loses detail, too narrow introduces noise.
2.1.2 Kernel Density Estimation
Another method is Kernel Density Estimation (KDE), where we compute
Where
Another common kernel is the Gaussian kernel:
Increasing the bandwidth
2.2 Parametric Methods
Parametric methods make simplified assumptions about the form, and estimate a fixed number of parameters. However, this gives high bias but low variance.
2.2.1 Univariate Gaussian Distribution
A common assumption is that the data follows a Univariate Gaussian distribution:
In this case we can find
2.2.2 Multivariate Gaussian Distribution
The Multivariate Gaussian distribution generalizes the univariate case to multiple dimensions:
Then we can find
2.2.3 Mixture Models
A Mixture Model gives us a better bias-variance tradeoff by combining multiple distributions:
A Gaussian Mixture Model (GMM) is most popular:
2.3 Likelihood
Likelihood quantifies how well a model has fit the data, by measuring the probability of observing
where
When we perform Gaussian fitting, we are actually minimising log likelihood! This can be proven by setting
and to find the minima.
3. GMM-EM Algorithm
GMMs can model complex distributions but have lots of parameters to optimise:
The GMM-EM algorithm fits a Gaussian Mixture Model to data using the Expectation-Maximization (EM) algorithm:
- Initialisation: Choose a
for the number of components, and randomly initialise parameters . - Expectation: Compute the responsibilities for each data point
and component : - Maximization: For each component: update the means
, covariances , and mixture proportions : - If not converged (no significant change in
or log-likelihood), repeat Expectation and Maximization steps.
Just like K-means, GMM-EM converges to local optima.
3.1 Choosing K
To find
where
Once again, we could also use cross validation to choose
- Split into training and validation sets.
- Fit GMM-EM on training set for different
. - Evaluate log-likelihood on validation set.
3.2 GMM vs K-Means
In both cases:
- We specify
clusters/components. - Convergence happens when assignments/parameters stabilise.
- Sensitive to initialisation.
Often K-means is run first to initialise GMM-EM.
GMM-EM does soft clustering - encodes to what degree each data point belongs to each cluster, while K-means does hard clustering - assigns each data point to exactly one cluster.
GMM-EM can also generate clusters with different probabilities and non-spherical clusters.