K-NNs & Decision Trees

1. K-NNs

Instance based learning is a type of lazy learning where the model simply stores training data and makes predictions based on similarity to training examples. The model is only built when a prediction is required.

A nearest neighbour classifier classifies an instance to the class label of the nearest training instance according to some distance metric. This is a 1-NN. It is a non-parametric model - the decision boundary is not predefined and naturally emerges from the model. It is sensitive to noise and overfitting, so we scale up to a K-NN. In that case, we select the most popular label amongst the K nearest neighbours. K is set to be odd to make sure a mode exists. Increasing K:

Has a smoother decision boundary (higher bias!)
Is less sensitive to training data (lower variance!)

To use a K-NN we need to define a distance metric:

Manhattan distance (-norm):
Euclidean distance (-norm):
Chebyshev distance (-norm):

1.1 Distance weighted K-NN

This variant weights the importance of its neighbours by their distance. Each neighbour is given a weight based on its distance. The weights per class are summated, and the class with the largest sum is chosen.

To calculate we use a weight function that favours closer samples:

Inverse of distance
Gaussian distribution

When K-NNs are distance weighted, increasing K has less of an effect on classification, which is good. when , this is a global method. Otherwise, it is a local method.

Distance weighted K-NNs are mroe robust to noisy training data. However, it suffers from the curse of dimensionality (increasing dimensions needs much higher Ks and difficult to compute distance metrics). K-NNs will also not filter out irrelevant features.

1.2 K-NNs for Regression

We can do regressions by computing the mean value across K nearest neighbours. We can also do a locally weighted regression by weighting based on the distance.

2. Decision Trees

An eager learner algorithm builds a succession of linear decision boundaries to classify a dataset. They learn with a top down greedy search through the solution space:

Search for an optimal splitting rule.
Split the dataset according to the rule.
Repeat (1) and (2) on each new subset.

We want partitioned datasets that are more pure (uniform in classification) than the original dataset. We can select this rule with different metrics, that are used by different algorithms:

ID3 and C4.5 use information gain, which quantifies reduction of information entropy.
CART uses gini impurity (what is the probability of a randomly picked point being incorrect?).
CART also uses variance reduction.

2.1 Information Entropy

Entropy is the measure of the uncertainty of a random variable. It is the expected amount of information required to fully define a random state. Low entropy variables are predictable, high entropy vars are not.

We can measure the amount of information as the amount of questions need to ask to figure out the state of a random variable. In general we need when takes states. We can also say that , so .The average information for is:

Example 2.1

If there are 4 boxes where the chance that is in each box is and , then:

bits.

bits.

of the time, not much information is needed. But of the time, lots is needed. The average is:

For a probability density function , we can define the continuous entropy (not complete as it can be negative):

2.2 Selecting Splitting Rules

For each rule, we calculate the information gain is defined as follows, where is the dataset and is the subsets:

For a binary tree its easier:

We select the split rule that maximises .

2.3 Decison Tree Inputs

Ordered values are continuous values that can be sorted. We can split them with a threshold rule or .
Categorical / symbol values are discrete values that cannot be ordered. We can split them with a rule or where is a subset of the possible values.

2.4 Overfitting

Like many machine learning algorithms, decision trees can overfit - when the model is fitted too closely to the dataset such that it picks up noise (high variance). Underfitting is when we miss the information in the data (high bias). To deal with this we can:

Stop early: e.g. have a max depth, minimum samples remaining in subset.
Pruning: we simplify the tree, using a validation dataset to test if it has improved.

Pruning Instructions

Go through each internal node that are connected to leaf nodes (last decision).

Turn into a leaf node with majority class label.

Evaluate pruned tree on validation set. Prune if accuracy higher than unpruned.

Repeat until all such nodes have been tested.

2.5 Random Forests

Instead of having one decision tree, we have many decision trees voting on the class label, each tree generated with a random sample of training set (bagging) and a random subset of features.

2.6 Regression Trees

Decision trees can also be used for regression. Instead of a class label, each leaf node predicts a real number. Use training examples to estimate output values. Use a different metric for splitting for variance reduction.

Back to Home

Table of Contents