Evaluation
1. Hyperparameter Tuning
To ensure meaningful evaluation, we split a dataset into a training set and a test set. The test set is not given to the model during training, only during evaluation. Both sets are shuffled to make them unordered. Our objective is to tune hyperparameters (model params) to get best performance.
Do NOT evaluate different hyperparameters on the same test set, as this leads to overfitting on the test set this is the same as using it as a training set). A better approach is to split into training, test and validation sets. The validation set is used to evaluate hyperparameters, and the test set is only used at the end to get a final performance measure. This is called hyperparameter tuning.
Once we have done our hyperparameter tuning, we can retrain the model on the combined training and validation sets to get the best model possible. We then evaluate this on the test set.
2. Cross Validation
When data is limited, extracting 3 datasets may be wasteful. Instead, dataset is divide into
We can't determine the performance of a specific model from this (there are
2.1 Parameter Tuning
Option 1:
- Use
fold for testing, fold for validation, and folds for training. - Finds an optimal set of hyperparameters per fold.
Option 2:
- Use
fold for testing. - Remaining
folds run an internal cross validation to find optimal hyperparameters. (Expensive!) - Each fold still has its own hyperparameters, but they are chosen based on more data.
Once everything is chosen and evaluated, we can combine all datasets into one and retrain the model with the optimal hyperparameters. However, we cannot be sure this is the best model as we have not tested it on unseen data.
3. Evaluation Metrics
A confusion matrix is a table that summarises the performance of a classification model. For binary classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Confusion Matrix Example
3.1 Accuracy
Accuracy is given by
3.2 Precision
Precision is given by
3.3 Recall
Recall is given by
Usually, there is a trade-off between precision and recall (either positive examples are correctly recognised with many false positives, or we miss many positive examples but have few false positives).
3.4 Macro / Micro Averaging
- Macro Averaging precision or recall is when we calculate precision/recall for each class, then average them. This treats all classes equally.
- Micro Averaging precision or recall is when we sum up the TP, FP, FN across all classes, then calculate precision/recall. This treats all examples equally. For binary classification, micro and macro averaging are the same.
3.5 F-Measure / F-Score
This combines precision and recall into a single metric:
3.6 Multiple Classes
We define one class as being positive and the rest negative. A confusion matrix for multiple classes looks like:
| Predicted C1 | Predicted C2 | Predicted C3 | *Predicted C4 | |
|---|---|---|---|---|
| Actual C1 | TP | FN | FN | FN |
| Actual C2 | FP | TN | ? | ? |
| Actual C3 | FP | ? | TN | ? |
| Actual C4 | FP | ? | ? | TN |
3.7 Mean Squared Error
MSE is used for regression tasks. It is defined as
Root mean squared error is defined as
3.8 Picking Metrics
We want our models to be accurate, fast, scalable, simple and interpretable. Its not just about accuracy!
3.9 Imbalanced Data Distribution
If our data is imbalanced between classes:
- Accuracy is misleading, as it follows the majority class.
- Macro-averaged recall can detect if one class is completely misclassified.
- F1 is useful, but can also be affected by class imbalance.
To fix this, we could normalise the rows in the confusion matrix. Alternatively, we could upsample or downsample the data to balance the classes.
4. Overfitting
- Overfitting: good performance on training data, poor on unseen data.
- Underfitting: poor performance on both training and unseen data.
Overfitting occurs when the model is too complex, training set is non representative or learning is performed for too long.
5. Confidence Intervals
A model's true error is the probability it will misclassify a random sample from the data distribution
Given a sample
6. Statistical Significance
Statistical tests tell us if the means of two sets are significantly different:
- Randomisation test: Randomly switch predictions between two models, calculate difference in accuracy. Repeat many times to get distribution of differences. See how extreme the actual difference is.
- Two Sample T-test: Estimate likelihood that the two metrics from different populations are actually different.
- Paired T-test: Estimate significance over multiple matched results.
To do this, we define a null hypothesis (the algorithms perform the same and performance differences are statistically insignificant). The test returns a p-value - the probability of observing the data assuming the null hypothesis is true. A small p-value (e.g.
6.1 P-Hacking
P-hacking is the misuse of data to find patterns that can be presented as statistically significant, when in reality they are not. To combat this, we can:
- Rank p-values from
different experiments: . - Calculate Benjamini-Hochberg critical value for each:
. - Significant results are those where
.