Artificial Neural Networks are a class of machine learning algorithms. Their architecture consists of connected neurons that are optimised by gradient descent. Deep learning refers to neural network models with multiplehidden layers.
2. Linear Regression
Linear regression is a type of supervised learning, where a dataset consisting of inputs and outputs is used to learn a function such that .
A linear regression model assumes that is a linear function of the inputs: . The model is trained by minimising the loss function between the predicted outputs and the true outputs. Our sum of squares loss function is defined as:
A good loss function is easily differentiable. To minimise the loss function, we use gradient descent. To do this, we update the parameters using their partial derivatives:
2.1 Gradient Descent
Gradient descent repeatedly updates parameters , by taking small steps in the negative direction of the partial derivatives , , where is the learning rate (or step size). For example:
1# Assume X and Y are given data points23a, b =0.0,0.0# Initial parameters4learning_rate =1e-11# Hyperparameter5num_epochs =5# Hyperparameter67for epoch inrange(num_epochs):8 update_a =0.09 update_b =0.010 error =0.011for i inrange(len(Y)):12 y_pred = a * X[i]+ b
13 update_a +=(y_pred - Y[i])* X[i]14 update_b +=(y_pred - Y[i])15 error += np.square(y_pred - Y[i])16 a -= learning_rate * update_a
17 b -= learning_rate * update_b
18 rmse = np.sqrt(error /len(Y))19print(f"Epoch {epoch+1}: a={a}, b={b}, RMSE={rmse}")
We can do this much faster with vector operations:
1a, b =0.0,0.0# Initial parameters2learning_rate =1e-11# Hyperparameter3num_epochs =5# Hyperparameter45for epoch inrange(num_epochs):6 y_pred = a * X + b
7 a = a - learning_rate * np.sum((y_pred - Y)* X)8 b = b - learning_rate * np.sum(y_pred - Y)9 rmse = np.sqrt(np.mean(np.square(y_pred - Y)))10print(f"Epoch {epoch+1}: a={a}, b={b}, RMSE={rmse}")
The gradient of a function is a vector of its partial derivatives:
Analytic Solution
Hypothetically, we could find an analytic solution to gradient descent:
Then, the optimal parameters can be found using the equation:
However, this is impractical as matrix inversion requires time.
2.2 Multiple Linear Regression
In practice, we use linear regression with multiple input features. Then, , where is the number of features. The loss function and gradient descent are similar to before, but with more parameters.
3. Neurons
A neuron has inputs with weights and bias , producing output . It also has an activation function that introduces non-linearity:
In notation, we often ommit the bias by adding an extra input with weight . We can rewrite this with vector notation using and : .
A simple activation function is the logistic or sigmoid function , which maps any to . Logistic regression is a classification model using this sigmoid activation function.
3.1 Perceptron
Perceptron is an algorithm that does not use gradient descent. However, it uses a threshold function as the activation function: . The learning rule:
If the desired output matches the predicted output , no update is made.
If and , the weights are increased to make more likely to be 1.
If and , the weights are decreased to make more likely to be 0.
Using a perceptron, we can learn any linearly separable function. The activation function is sharp and non-differentiable, so we cannot use it with gradient descent.
4. Multi-layer Networks
Neurons can be connected in parallel - this way each neuron will detect something different from the input data. By connecting them in sequence we can learn higher order features. With an arbitrary number of neurons, we can hypothetically model an arbitrary function. This is called a multi-layer network or multi-layer perceptron (MLP) (even though its not a perceptron).
In practice we don't want to draw out all the connections in a multi-layer network, as it quickly becomes very complex. Instead, we represent layers of neurons as boxes:
IxH1h1I->H1H2h2H1->H2OyH2->O
Given this network, we can calculate:
Neural networks are able to extract complex features from data by stacking multiple layers of neurons. Each layer learns to transform its input into a more abstract representation, enabling the network to capture intricate patterns.
5. Activation Functions
If data is linearly separable, we should use a linear activation function: . This also reduces a multi-layer network to a single-layer network, which is not desirable. Alternatively, we can use non-linear activation functions:
Sigmoid: - maps to . A soft version of the threshold function.
Tanh: - maps to . Also .
ReLU: - maps to . Computationally efficient and mitigates the vanishing gradient problem.
Softmax: - scales values into a probability distribution such that all values sum to .
Most activation functions are applied element-wise (except softmax). ReLU is commonly used in deep networks, but sigmoid and tanh are more robust. The activation layer of the output layer depends on the task:
Binary classification - use sigmoid or tanh.
Unbounded Score - linear.
Probability Distributions - softmax.
We can easily implement a neural network with torch:
1import torch
2import torch.nn as nn
34classNet(nn.Module):5def__init__(self):6super(Net, self).__init__()7 self.layer_h = nn.Linear(10,5)# 10 dimensional input, 5 hidden units8 self.layer_y = nn.Linear(5,1)# 5 hidden units, 1 dimensional output910defforward(self, x):11 h = torch.tanh(self.layer_h(x))# Hidden layer with tanh activation12 y = torch.sigmoid(self.layer_y(h))# Output layer with sigmoid activation13return y
1415net = Net()16data = torch.FloatTensor([[0.5]*10])# Example input17output = net(data)
6. Loss Functions
Loss functions are a function to minimize when optimising neural networks. We update the parameters using gradient descent, , where is the loss function. For regression, Mean Squared Error, and .
For classification, we either have binary (two classes), multi class (many classes) or multi label (each datapoint can take multiple classes). These can all be optimised using different versions of cross entropy loss, maximising the likelihood of the correct class:
Assuming that the data points are independent and identically distributed (i.i.d.). For binary classification, we can use the bernoulli distribution, so . However, the values here get very small and difficult to work with, so we take the log likelihood, giving our cross entropy loss functions:
Where is the set of possible classes for categorical (multiclass) classification. For multilabel classification, we can use multiple binary cross entropy losses, one for each label.
7. Backpropagation
We can use batching to combine our data point vectors into a matrix , where is the number of data points and is the number of features. This allows us to compute the outputs of a layer of neurons in one matrix operation: , where is the weight matrix, is the bias vector, and is the output matrix. This is much more parallelisable on GPUs.
Backpropogation optmises gradient descent for multi-layer networks, by avoiding recalculating the partial derivatives of each layer. A forward pass computes the outputs of each layer, and a backward pass computes the gradients of each layer using the chain rule. For example:
This way, we have less computation, as we only need to compute once, and reuse it for each layer.
7.1 Backpropagation Example
Assume we have a linear layer .
We already receive from the next layer.
To update the weights we must calculate , as well as . In this case, we can rewrite as , where is the input matrix. Also, , so .
To pass the gradient to a lower layer we need . We know that , so .
To pass the activation function through backpropagation, we need to compute , where . We can rewrite , where is the element-wise product and is the derivative of the activation function.
Activation Function
Formula
Derivative
Linear
Sigmoid
Tanh
ReLU
Softmax
for cross entropy loss
Since softmax outputs a vector, and are matrices.
8. Gradient Descent
Gradient descent is used to iteratively train a model. With learning rate, we update weights . This means all network functions and the loss must be differentiable. To do it, initialize weights randomly; then, until convergence, compute gradient based on the whole dataset, and update weights.
However, the whole dataset usually is too large, so we do mini-batched gradient descent.
Initialize weights randomly.
Until convergence: loop over batches of datapoints, compute gradient based on the batch only, and update weights.
8.1 Learning Rate
In general, loss surfaces are complex and we want to avoid local minima. Too low a learning rate and we wont converge, too high and we may overshoot minima.
Adaptive learning rates may have a different learning rate for each parameter, taking bigger steps if the gradient is small, and vice versa.
Learning rate decay takes smaller steps as we get closer to the minimum: .
8.2 Weight Initialization
Zero: this simple method avoids neurons starting out with a bias. However, if all the weights start off the same, all neurons will learn the same features.
Normal: draw weights from a normal distribution with mean and small standard deviation .
Xavier Gorot: where is the number of inputs and is the number of outputs. This keeps the variance of activations and backpropagated gradients roughly the same across layers.
Although randomness does lead to different results, we can run with different random seeds and report the average.
8.3 Data Normalization
Data normalization helps with convergence, as weight updates are proportional to the input data. Common methods include:
Min Max Normalization: scales data to .
Standardization (-normalization): where is the mean and is the standard deviation, giving data with mean and variance .
Scaling values must only be calculated on the training set.
8.4 Gradient Checking
Gradient checking verifies that backpropagation is correctly computing gradients. There are two methods:
Check weight difference before and after gradient descent: .
Change weight slightly and check loss difference: .
Both methods should give very similar values of .
9. Overfitting
There is a strong correlation between capacity of a neural network and its ability to overfit the training data. To prevent overfitting, we can:
Decrease the capacity (reduce number of layers/neurons).
Use more training data.
Stop early: use validation set to monitor performance, and stop training when performnace has not improved for a number of epochs.
Regularisation: add information / constraints to the model to prevent overfitting (e.g. limiting weight magnitude).
L2 Regularisation: add square weights to the loss function, encouraging sharing between features. . So .
L1 Regularisation: add absolute weights to the loss function, encouraging sparsity. . So, .
Dropout: randomly disable neurons during training, preventing co-adaptation. Each neuron is kept with probability during training, and scaled by during testing (typically ).