Neural Networks

1. Neural Networks

Artificial Neural Networks are a class of machine learning algorithms. Their architecture consists of connected neurons that are optimised by gradient descent. Deep learning refers to neural network models with multiple hidden layers.

2. Linear Regression

Linear regression is a type of supervised learning, where a dataset consisting of inputs and outputs is used to learn a function such that .

A linear regression model assumes that is a linear function of the inputs: . The model is trained by minimising the loss function between the predicted outputs and the true outputs. Our sum of squares loss function is defined as:

A good loss function is easily differentiable. To minimise the loss function, we use gradient descent. To do this, we update the parameters using their partial derivatives:

2.1 Gradient Descent

Gradient descent repeatedly updates parameters , by taking small steps in the negative direction of the partial derivatives , , where is the learning rate (or step size). For example:

1# Assume X and Y are given data points 2 3a, b = 0.0, 0.0 # Initial parameters 4learning_rate = 1e-11 # Hyperparameter 5num_epochs = 5 # Hyperparameter 6 7for epoch in range(num_epochs): 8 update_a = 0.0 9 update_b = 0.0 10 error = 0.0 11 for i in range(len(Y)): 12 y_pred = a * X[i] + b 13 update_a += (y_pred - Y[i]) * X[i] 14 update_b += (y_pred - Y[i]) 15 error += np.square(y_pred - Y[i]) 16 a -= learning_rate * update_a 17 b -= learning_rate * update_b 18 rmse = np.sqrt(error / len(Y)) 19 print(f"Epoch {epoch+1}: a={a}, b={b}, RMSE={rmse}")

We can do this much faster with vector operations:

1a, b = 0.0, 0.0 # Initial parameters 2learning_rate = 1e-11 # Hyperparameter 3num_epochs = 5 # Hyperparameter 4 5for epoch in range(num_epochs): 6 y_pred = a * X + b 7 a = a - learning_rate * np.sum((y_pred - Y) * X) 8 b = b - learning_rate * np.sum(y_pred - Y) 9 rmse = np.sqrt(np.mean(np.square(y_pred - Y))) 10 print(f"Epoch {epoch+1}: a={a}, b={b}, RMSE={rmse}")

The gradient of a function is a vector of its partial derivatives:

Analytic Solution

Hypothetically, we could find an analytic solution to gradient descent:

Then, the optimal parameters can be found using the equation:

However, this is impractical as matrix inversion requires time.

2.2 Multiple Linear Regression

In practice, we use linear regression with multiple input features. Then, , where is the number of features. The loss function and gradient descent are similar to before, but with more parameters.

3. Neurons

A neuron has inputs with weights and bias , producing output . It also has an activation function that introduces non-linearity:

In notation, we often ommit the bias by adding an extra input with weight . We can rewrite this with vector notation using and : .

A simple activation function is the logistic or sigmoid function , which maps any to . Logistic regression is a classification model using this sigmoid activation function.

3.1 Perceptron

Perceptron is an algorithm that does not use gradient descent. However, it uses a threshold function as the activation function: . The learning rule:

Using a perceptron, we can learn any linearly separable function. The activation function is sharp and non-differentiable, so we cannot use it with gradient descent.

4. Multi-layer Networks

Neurons can be connected in parallel - this way each neuron will detect something different from the input data. By connecting them in sequence we can learn higher order features. With an arbitrary number of neurons, we can hypothetically model an arbitrary function. This is called a multi-layer network or multi-layer perceptron (MLP) (even though its not a perceptron).

I1 x1 A1 I1->A1 A2 I1->A2 A3 I1->A3 I2 x2 I2->A1 I2->A2 I2->A3 I3 x3 I3->A1 I3->A2 I3->A3 B1 A1->B1 B2 A1->B2 B3 A1->B3 A2->B1 A2->B2 A2->B3 A3->B1 A3->B2 A3->B3 O1 y1 B1->O1 O2 y2 B2->O2 O3 y3 B3->O3

In practice we don't want to draw out all the connections in a multi-layer network, as it quickly becomes very complex. Instead, we represent layers of neurons as boxes:

I x H1 h1 I->H1 H2 h2 H1->H2 O y H2->O

Given this network, we can calculate:

Neural networks are able to extract complex features from data by stacking multiple layers of neurons. Each layer learns to transform its input into a more abstract representation, enabling the network to capture intricate patterns.

5. Activation Functions

If data is linearly separable, we should use a linear activation function: . This also reduces a multi-layer network to a single-layer network, which is not desirable. Alternatively, we can use non-linear activation functions:

Most activation functions are applied element-wise (except softmax). ReLU is commonly used in deep networks, but sigmoid and tanh are more robust. The activation layer of the output layer depends on the task:

We can easily implement a neural network with torch:

1import torch 2import torch.nn as nn 3 4class Net(nn.Module): 5 def __init__(self): 6 super(Net, self).__init__() 7 self.layer_h = nn.Linear(10, 5) # 10 dimensional input, 5 hidden units 8 self.layer_y = nn.Linear(5, 1) # 5 hidden units, 1 dimensional output 9 10 def forward(self, x): 11 h = torch.tanh(self.layer_h(x)) # Hidden layer with tanh activation 12 y = torch.sigmoid(self.layer_y(h)) # Output layer with sigmoid activation 13 return y 14 15net = Net() 16data = torch.FloatTensor([[0.5]*10]) # Example input 17output = net(data)

6. Loss Functions

Loss functions are a function to minimize when optimising neural networks. We update the parameters using gradient descent, , where is the loss function. For regression, Mean Squared Error , and .

For classification, we either have binary (two classes), multi class (many classes) or multi label (each datapoint can take multiple classes). These can all be optimised using different versions of cross entropy loss, maximising the likelihood of the correct class:

Assuming that the data points are independent and identically distributed (i.i.d.). For binary classification, we can use the bernoulli distribution, so . However, the values here get very small and difficult to work with, so we take the log likelihood, giving our cross entropy loss functions:

Where is the set of possible classes for categorical (multiclass) classification. For multilabel classification, we can use multiple binary cross entropy losses, one for each label.

7. Backpropagation

We can use batching to combine our data point vectors into a matrix , where is the number of data points and is the number of features. This allows us to compute the outputs of a layer of neurons in one matrix operation: , where is the weight matrix, is the bias vector, and is the output matrix. This is much more parallelisable on GPUs.

Backpropogation optmises gradient descent for multi-layer networks, by avoiding recalculating the partial derivatives of each layer. A forward pass computes the outputs of each layer, and a backward pass computes the gradients of each layer using the chain rule. For example:

This way, we have less computation, as we only need to compute once, and reuse it for each layer.

7.1 Backpropagation Example

Assume we have a linear layer .

  1. We already receive from the next layer.
  2. To update the weights we must calculate , as well as . In this case, we can rewrite as , where is the input matrix. Also, , so .
  3. To pass the gradient to a lower layer we need . We know that , so .
  4. To pass the activation function through backpropagation, we need to compute , where . We can rewrite , where is the element-wise product and is the derivative of the activation function.
Activation FunctionFormulaDerivative
Linear
Sigmoid
Tanh
ReLU
Softmax for cross entropy loss

Since softmax outputs a vector, and are matrices.

8. Gradient Descent

Gradient descent is used to iteratively train a model. With learning rate , we update weights . This means all network functions and the loss must be differentiable. To do it, initialize weights randomly; then, until convergence, compute gradient based on the whole dataset, and update weights.

However, the whole dataset usually is too large, so we do mini-batched gradient descent.

  1. Initialize weights randomly.
  2. Until convergence: loop over batches of datapoints, compute gradient based on the batch only, and update weights.

8.1 Learning Rate

In general, loss surfaces are complex and we want to avoid local minima. Too low a learning rate and we wont converge, too high and we may overshoot minima.

8.2 Weight Initialization

Although randomness does lead to different results, we can run with different random seeds and report the average.

8.3 Data Normalization

Data normalization helps with convergence, as weight updates are proportional to the input data. Common methods include:

Scaling values must only be calculated on the training set.

8.4 Gradient Checking

Gradient checking verifies that backpropagation is correctly computing gradients. There are two methods:

Both methods should give very similar values of .

9. Overfitting

There is a strong correlation between capacity of a neural network and its ability to overfit the training data. To prevent overfitting, we can:

Back to Home