Evaluation

1. Hyperparameter Tuning

To ensure meaningful evaluation, we split a dataset into a training set and a test set. The test set is not given to the model during training, only during evaluation. Both sets are shuffled to make them unordered. Our objective is to tune hyperparameters (model params) to get best performance.

Do NOT evaluate different hyperparameters on the same test set, as this leads to overfitting on the test set this is the same as using it as a training set). A better approach is to split into training, test and validation sets. The validation set is used to evaluate hyperparameters, and the test set is only used at the end to get a final performance measure. This is called hyperparameter tuning.

Once we have done our hyperparameter tuning, we can retrain the model on the combined training and validation sets to get the best model possible. We then evaluate this on the test set.

2. Cross Validation

When data is limited, extracting 3 datasets may be wasteful. Instead, dataset is divide into folds, use folds for training & validation and fold for testing. This is repeated times, each time with a different fold as the test set. The final performance is averaged across all runs. This is called k-fold cross validation. Global error is estimated as:

We can't determine the performance of a specific model from this (there are models), but we can evaluate the algorithm used to generate the models.

2.1 Parameter Tuning

Option 1:

Option 2:

Once everything is chosen and evaluated, we can combine all datasets into one and retrain the model with the optimal hyperparameters. However, we cannot be sure this is the best model as we have not tested it on unseen data.

3. Evaluation Metrics

A confusion matrix is a table that summarises the performance of a classification model. For binary classification:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Confusion Matrix Example

3.1 Accuracy

Accuracy is given by . It is the proportion of correct predictions. Classification error is .

3.2 Precision

Precision is given by . It is the proportion of positive predictions that are correct.

3.3 Recall

Recall is given by . It is the proportion of actual positives that are correctly identified.

Usually, there is a trade-off between precision and recall (either positive examples are correctly recognised with many false positives, or we miss many positive examples but have few false positives).

3.4 Macro / Micro Averaging

3.5 F-Measure / F-Score

This combines precision and recall into a single metric:

controls the weight of recall vs precision. If , recall is weighted more, if , precision is weighted more.

3.6 Multiple Classes

We define one class as being positive and the rest negative. A confusion matrix for multiple classes looks like:

Predicted C1Predicted C2Predicted C3*Predicted C4
Actual C1TPFNFNFN
Actual C2FPTN??
Actual C3FP?TN?
Actual C4FP??TN

3.7 Mean Squared Error

MSE is used for regression tasks. It is defined as , where is the true value and is the predicted value. It measures the average squared difference between the predicted and actual values. Lower MSE indicates better model performance.

Root mean squared error is defined as . It has the same units as the target variable, making it more interpretable.

3.8 Picking Metrics

We want our models to be accurate, fast, scalable, simple and interpretable. Its not just about accuracy!

3.9 Imbalanced Data Distribution

If our data is imbalanced between classes:

To fix this, we could normalise the rows in the confusion matrix. Alternatively, we could upsample or downsample the data to balance the classes.

4. Overfitting

Overfitting occurs when the model is too complex, training set is non representative or learning is performed for too long.

5. Confidence Intervals

A model's true error is the probability it will misclassify a random sample from the data distribution : . The sample error is based on data sample : where .

Given a sample with , we can estimate with a confidence interval:

6. Statistical Significance

Statistical tests tell us if the means of two sets are significantly different:

To do this, we define a null hypothesis (the algorithms perform the same and performance differences are statistically insignificant). The test returns a p-value - the probability of observing the data assuming the null hypothesis is true. A small p-value (e.g. ) indicates strong evidence against the null hypothesis, so we reject it.

6.1 P-Hacking

P-hacking is the misuse of data to find patterns that can be presented as statistically significant, when in reality they are not. To combat this, we can:

  1. Rank p-values from different experiments: .
  2. Calculate Benjamini-Hochberg critical value for each: .
  3. Significant results are those where .
Back to Home