Evaluation

1. Hyperparameter Tuning

To ensure meaningful evaluation, we split a dataset into a training set and a test set. The test set is not given to the model during training, only during evaluation. Both sets are shuffled to make them unordered. Our objective is to tune hyperparameters (model params) to get best performance.

Do NOT evaluate different hyperparameters on the same test set, as this leads to overfitting on the test set this is the same as using it as a training set). A better approach is to split into training, test and validation sets. The validation set is used to evaluate hyperparameters, and the test set is only used at the end to get a final performance measure. This is called hyperparameter tuning.

Once we have done our hyperparameter tuning, we can retrain the model on the combined training and validation sets to get the best model possible. We then evaluate this on the test set.

2. Cross Validation

When data is limited, extracting 3 datasets may be wasteful. Instead, dataset is divide into folds, use folds for training & validation and fold for testing. This is repeated times, each time with a different fold as the test set. The final performance is averaged across all runs. This is called k-fold cross validation. Global error is estimated as:

We can't determine the performance of a specific model from this (there are models), but we can evaluate the algorithm used to generate the models.

2.1 Parameter Tuning

Option 1:

Use fold for testing, fold for validation, and folds for training.
Finds an optimal set of hyperparameters per fold.

Option 2:

Use fold for testing.
Remaining folds run an internal cross validation to find optimal hyperparameters. (Expensive!)
Each fold still has its own hyperparameters, but they are chosen based on more data.

Once everything is chosen and evaluated, we can combine all datasets into one and retrain the model with the optimal hyperparameters. However, we cannot be sure this is the best model as we have not tested it on unseen data.

3. Evaluation Metrics

A confusion matrix is a table that summarises the performance of a classification model. For binary classification:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Confusion Matrix Example
$✔ ✔ ✔ ✔ ✔ ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ ✘ ✔$

3.1 Accuracy

Accuracy is given by . It is the proportion of correct predictions. Classification error is .

3.2 Precision

Precision is given by . It is the proportion of positive predictions that are correct.

3.3 Recall

Recall is given by . It is the proportion of actual positives that are correctly identified.

Usually, there is a trade-off between precision and recall (either positive examples are correctly recognised with many false positives, or we miss many positive examples but have few false positives).

3.4 Macro / Micro Averaging

Macro Averaging precision or recall is when we calculate precision/recall for each class, then average them. This treats all classes equally.
Micro Averaging precision or recall is when we sum up the TP, FP, FN across all classes, then calculate precision/recall. This treats all examples equally. For binary classification, micro and macro averaging are the same.

3.5 F-Measure / F-Score

This combines precision and recall into a single metric:

controls the weight of recall vs precision. If , recall is weighted more, if , precision is weighted more.

3.6 Multiple Classes

We define one class as being positive and the rest negative. A confusion matrix for multiple classes looks like:

	Predicted C1	Predicted C2	Predicted C3	*Predicted C4
Actual C1	TP	FN	FN	FN
Actual C2	FP	TN	?	?
Actual C3	FP	?	TN	?
Actual C4	FP	?	?	TN

3.7 Mean Squared Error

MSE is used for regression tasks. It is defined as , where is the true value and is the predicted value. It measures the average squared difference between the predicted and actual values. Lower MSE indicates better model performance.

Root mean squared error is defined as . It has the same units as the target variable, making it more interpretable.

3.8 Picking Metrics

We want our models to be accurate, fast, scalable, simple and interpretable. Its not just about accuracy!

3.9 Imbalanced Data Distribution

If our data is imbalanced between classes:

Accuracy is misleading, as it follows the majority class.
Macro-averaged recall can detect if one class is completely misclassified.
F1 is useful, but can also be affected by class imbalance.

To fix this, we could normalise the rows in the confusion matrix. Alternatively, we could upsample or downsample the data to balance the classes.

4. Overfitting

Overfitting: good performance on training data, poor on unseen data.
Underfitting: poor performance on both training and unseen data.

Overfitting occurs when the model is too complex, training set is non representative or learning is performed for too long.

5. Confidence Intervals

A model's true error is the probability it will misclassify a random sample from the data distribution : . The sample error is based on data sample : where .

Given a sample with , we can estimate with a confidence interval:

6. Statistical Significance

Statistical tests tell us if the means of two sets are significantly different:

Randomisation test: Randomly switch predictions between two models, calculate difference in accuracy. Repeat many times to get distribution of differences. See how extreme the actual difference is.
Two Sample T-test: Estimate likelihood that the two metrics from different populations are actually different.
Paired T-test: Estimate significance over multiple matched results.

To do this, we define a null hypothesis (the algorithms perform the same and performance differences are statistically insignificant). The test returns a p-value - the probability of observing the data assuming the null hypothesis is true. A small p-value (e.g. ) indicates strong evidence against the null hypothesis, so we reject it.

6.1 P-Hacking

P-hacking is the misuse of data to find patterns that can be presented as statistically significant, when in reality they are not. To combat this, we can:

Rank p-values from different experiments: .
Calculate Benjamini-Hochberg critical value for each: .
Significant results are those where .

Back to Home

Table of Contents