CNNs

CNNs introduce hierarchy and locality into neural networks.

1. Universal Approximator

Let be a non-constant, bounded and monotonically increasing function. For any and any continuous function defined on a compact subset of , integer , real constants and real vectors for such that:

If is a sensible activation function, then we can see that any function can be approximated by a neural network with a single hidden layer and enough neurons. In practice, is very large & we suffer from the curse of dimensionality.

1.1 Curse of Dimensionality

As the number of features or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

To approximate a Lipschitz (continuous) function to accuracy requires training samples.

In the dimension the -dimensional volume of the interior will be times the volume of its original shape, where is the scaling factor. Therefore, the volume of the rind relative to the original volume is . As a function of , its rate of broth is , where is the differential operator.

In higher dimensions most Euclidean distances between observations in a dataset are nearly the same and close () to the diameter of the region in which they are enclosed.

1.2 Invariance & Equivariance

Shift invariance describes a system's unchanging response to input shifts. For example, in image recognition, a shift-invariant system would recognize an object regardless of its position in the image. Assuming a function , its shift invariant for shift operator (shifting image by ). Then, .

Equivariance means that if the input changes in a certain way, the output changes in the same way. For example, in image processing, if an image is rotated, an equivariant system would produce an output that is also rotated in the same manner. A function , its equivariant for shift operator if .

1.3 Inductive Bias

We can introduce two principles:

  1. Translation Invariance: a shift in the input should lead to a shift in the hidden representation.
  2. Locality: we should not have to look far away from a location to glean relevant information about that area.

These can be applied with a sliding window approach, using correlation where is the image patch and is the kernel/filter.

1.4 Deformation Invariance

We may also want invariance to deformation. Here we would have a warp operator which warps the image by field . Then, .

2. Convolutions

A convolution is is commutative, associative, associative with scalar multiplication and distributive.

3. CNNs

In conventional NNs, the input tensor is flattened into a vector, but this loses locality & translation invariance principles.

3.1 Convolutional Layer

In a CNN, we keep locality as a convolution, where the input tensor has width , height and depth (channels). The kernel/filter has width , height and depth . The output tensor (activation map) has width and height . In practice, convolutions are implemented as a series of dot products.

Each kernel produces its own activation map, and these are stacked to produce the output volume. If there are kernels, the output tensor has depth .

A CNN is a sequence of convolutional layers interleaved with activation functions.

Each filter has parameters, including the bias term.

A convolution reduces the depth of the network, aggregating many channels into one.

3.1.1 Stride & Padding

In a convolution:

3.1.2 Computational Complexity

As each convolution decreases the and dimensions, the computational cost of each layer decreases, saving operations.

This can be exploited by using multiple small filters to approximate a single large filter.

3.2 Pooling Layer

Pooling aggregates and downsamples the input sensor. The order of pixels does not matter, only their values. This introduces translational invariance. Some pooling methods are:

This is applied to each channel separately.

3.3 Fully Connected Layer

A standard dense layer, where each neuron is connected to every neuron in the previous layer.

3.4 Flatten Layer

These connect the convolutional layers to the fully connected layers by flattening the 3D tensor into a 1D vector.

3.5 Activation Functions

Determine how neurons respond to inputs, introducing non-linearity into the network. A neuron:

A naive approach is to set an activation threshold, but this is not differentiable, so we cannot do backpropagation.

A linear activation function has a constant gradiant, so no relationship to during backpropagation. Therefore, we need a non-linear activation function, such as:

3.6 Loss Functions

Quantifies how well the network is doing. Some include:

The choice of a loss function depends on the desired output.

3.7 Optimisation

Dense layers are computationally and memory exnepsive. CNNs reduce this by using parameter sharing and sparse interactions. 1x1 convolutions act like a multi-layer perceptron per pixel.

3.8 Batch Normalization

Loss is calculated at the last layer, meaning the last layer learns the quickest. Dat ainput is at the first layer, so if the first layer changes, the last layer needs to relearn many times, causing slow convergence. This is called internal covariate shift.

To avoid changing last layers while learning first layers, we can fix mean and variance:

Then adjust it separately:

Where is the batch, and are learned parameters. This speeds up training and acts as a regularizer.

To also reduece covariate shift, we add noise injection to the inputs during training,: . This also removes the need for drouput. Ideal minibatch size is 64-256.

In a dense layer, one normalization for all channels. In a convolutional layer, one normalization per channel. A new mean and variance are computed for each minibatch.

3.9 Residual Networks

A ResNet ensures the function space of each layer includes the previous layer. This avoids gradient vanishing problems in deep networks. A residual block has a skip connection that outputs , where is the output of the layer.

Back to Home