K-NNs & Decision Trees
1. K-NNs
Instance based learning is a type of lazy learning where the model simply stores training data and makes predictions based on similarity to training examples. The model is only built when a prediction is required.
A nearest neighbour classifier classifies an instance to the class label of the nearest training instance according to some distance metric. This is a 1-NN. It is a non-parametric model - the decision boundary is not predefined and naturally emerges from the model. It is sensitive to noise and overfitting, so we scale up to a K-NN. In that case, we select the most popular label amongst the K nearest neighbours. K is set to be odd to make sure a mode exists. Increasing K:
- Has a smoother decision boundary (higher bias!)
- Is less sensitive to training data (lower variance!)
To use a K-NN we need to define a distance metric:
- Manhattan distance (
-norm): - Euclidean distance (
-norm): - Chebyshev distance (
-norm):
1.1 Distance weighted K-NN
This variant weights the importance of its neighbours by their distance. Each neighbour is given a weight
To calculate
- Inverse of distance
- Gaussian distribution
When K-NNs are distance weighted, increasing K has less of an effect on classification, which is good. when
Distance weighted K-NNs are mroe robust to noisy training data. However, it suffers from the curse of dimensionality (increasing dimensions needs much higher Ks and difficult to compute distance metrics). K-NNs will also not filter out irrelevant features.
1.2 K-NNs for Regression
We can do regressions by computing the mean value across K nearest neighbours. We can also do a locally weighted regression by weighting based on the distance.
2. Decision Trees
An eager learner algorithm builds a succession of linear decision boundaries to classify a dataset. They learn with a top down greedy search through the solution space:
- Search for an optimal splitting rule.
- Split the dataset according to the rule.
- Repeat (1) and (2) on each new subset.
We want partitioned datasets that are more pure (uniform in classification) than the original dataset. We can select this rule with different metrics, that are used by different algorithms:
- ID3 and C4.5 use information gain, which quantifies reduction of information entropy.
- CART uses gini impurity (what is the probability of a randomly picked point being incorrect?).
- CART also uses variance reduction.
2.1 Information Entropy
Entropy is the measure of the uncertainty of a random variable. It is the expected amount of information required to fully define a random state. Low entropy variables are predictable, high entropy vars are not.
We can measure the amount of information as the amount of questions need to ask to figure out the state of a random variable. In general we need
Example 2.1
If there are 4 boxes
where the chance that is in each box is and , then:
bits. bits.
of the time, not much information is needed. But of the time, lots is needed. The average is:
For a probability density function
2.2 Selecting Splitting Rules
For each rule, we calculate the information gain is defined as follows, where
For a binary tree its easier:
We select the split rule that maximises
2.3 Decison Tree Inputs
- Ordered values are continuous values that can be sorted. We can split them with a threshold rule
or . - Categorical / symbol values are discrete values that cannot be ordered. We can split them with a rule
or where is a subset of the possible values.
2.4 Overfitting
Like many machine learning algorithms, decision trees can overfit - when the model is fitted too closely to the dataset such that it picks up noise (high variance). Underfitting is when we miss the information in the data (high bias). To deal with this we can:
- Stop early: e.g. have a max depth, minimum samples remaining in subset.
- Pruning: we simplify the tree, using a validation dataset to test if it has improved.
Pruning Instructions
- Go through each internal node that are connected to leaf nodes (last decision).
- Turn into a leaf node with majority class label.
- Evaluate pruned tree on validation set. Prune if accuracy higher than unpruned.
- Repeat until all such nodes have been tested.
2.5 Random Forests
Instead of having one decision tree, we have many decision trees voting on the class label, each tree generated with a random sample of training set (bagging) and a random subset of features.
2.6 Regression Trees
Decision trees can also be used for regression. Instead of a class label, each leaf node predicts a real number. Use training examples to estimate output values. Use a different metric for splitting for variance reduction.