CNNs

CNNs introduce hierarchy and locality into neural networks.

1. Universal Approximator

Let be a non-constant, bounded and monotonically increasing function. For any and any continuous function defined on a compact subset of , integer , real constants and real vectors for such that:

If is a sensible activation function, then we can see that any function can be approximated by a neural network with a single hidden layer and enough neurons. In practice, is very large & we suffer from the curse of dimensionality.

1.1 Curse of Dimensionality

As the number of features or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

To approximate a Lipschitz (continuous) function to accuracy requires training samples.

In the dimension the -dimensional volume of the interior will be times the volume of its original shape, where is the scaling factor. Therefore, the volume of the rind relative to the original volume is . As a function of , its rate of broth is , where is the differential operator.

Beginning with no shrinking (), and that is decreasing (), we see the initial rate of growth of the rind volume is .
Initially, the volume of the rind grows times faster than rate at which the object is being shrunk.
In higher dimensions, tiny changes in distance translate to large changes in volume.

In higher dimensions most Euclidean distances between observations in a dataset are nearly the same and close () to the diameter of the region in which they are enclosed.

1.2 Invariance & Equivariance

Shift invariance describes a system's unchanging response to input shifts. For example, in image recognition, a shift-invariant system would recognize an object regardless of its position in the image. Assuming a function , its shift invariant for shift operator (shifting image by ). Then, .

Equivariance means that if the input changes in a certain way, the output changes in the same way. For example, in image processing, if an image is rotated, an equivariant system would produce an output that is also rotated in the same manner. A function , its equivariant for shift operator if .

1.3 Inductive Bias

We can introduce two principles:

Translation Invariance: a shift in the input should lead to a shift in the hidden representation.
Locality: we should not have to look far away from a location to glean relevant information about that area.

These can be applied with a sliding window approach, using correlation where is the image patch and is the kernel/filter.

1.4 Deformation Invariance

We may also want invariance to deformation. Here we would have a warp operator which warps the image by field . Then, .

2. Convolutions

Convolution: for .
Correlation: for .
Discrete Convolution: given arrays and , their convolution is . When or are undefined they are assumed to be zero.

A convolution is is commutative, associative, associative with scalar multiplication and distributive.

3. CNNs

In conventional NNs, the input tensor is flattened into a vector, but this loses locality & translation invariance principles.

3.1 Convolutional Layer

In a CNN, we keep locality as a convolution, where the input tensor has width , height and depth (channels). The kernel/filter has width , height and depth . The output tensor (activation map) has width and height . In practice, convolutions are implemented as a series of dot products.

Each kernel produces its own activation map, and these are stacked to produce the output volume. If there are kernels, the output tensor has depth .

A CNN is a sequence of convolutional layers interleaved with activation functions.

Each filter has parameters, including the bias term.

A convolution reduces the depth of the network, aggregating many channels into one.

3.1.1 Stride & Padding

In a convolution:

Stride is the number of pixels by which we slide the filter across the input image.
Padding is the number of pixels added to the border of the input image. If you need to keep the and dimensions then you can 0-pad the image.

3.1.2 Computational Complexity

As each convolution decreases the and dimensions, the computational cost of each layer decreases, saving operations.

This can be exploited by using multiple small filters to approximate a single large filter.

3.2 Pooling Layer

Pooling aggregates and downsamples the input sensor. The order of pixels does not matter, only their values. This introduces translational invariance. Some pooling methods are:

Max Pooling: takes the maximum value in each patch. Max pooling can break shift equivariance. This can be partially solved by anti-aliasing (blurring) before downsampling.
Average Pooling: takes the average value in each patch.
etc.

This is applied to each channel separately.

3.3 Fully Connected Layer

A standard dense layer, where each neuron is connected to every neuron in the previous layer.

3.4 Flatten Layer

These connect the convolutional layers to the fully connected layers by flattening the 3D tensor into a 1D vector.

3.5 Activation Functions

Determine how neurons respond to inputs, introducing non-linearity into the network. A neuron:

A naive approach is to set an activation threshold, but this is not differentiable, so we cannot do backpropagation.

A linear activation function has a constant gradiant, so no relationship to during backpropagation. Therefore, we need a non-linear activation function, such as:

Sigmoid is non-linear. Its steepest at small , meaning smaller inputs have a larger effect. However, it suffers from vanishing gradients for large .
Tanh is a scaled sigmoid, outputting . It is zero-centered, but still suffers from vanishing gradients.
ReLU is effiicent, combinations are non linear. However it suffers from the dying ReLU problem, where neurons can get stuck outputting 0.
Leaky ReLU for small (e.g. 0.01) mitigates the dying ReLU problem.
PReLU where is learned during training.
Softplus is a smooth approximation to ReLU. Output is always positive. If , then .
LogSigmoid is the log of the sigmoid function. It is numerically stable for large negative .
Softmin operates on vectors, producing a probability distribution.
Softmax operates on vectors, producing a probability distribution.
LogSoftmax is the logarithm of the softmax function, useful for numerical stability in classification tasks.

3.6 Loss Functions

Quantifies how well the network is doing. Some include:

L2 Norm (MSE): . Can be reduced either with mean or sum. Penalizes large errors more than small ones.
L1 Norm . Can be reduced either with mean or sum. More robust to outliers than L2.
Smooth L1 Loss , where . Combines L1 and L2.
Negative Log Likelihood: assume the network's output represents log likelihoods of each class. Then, , where and is the weight for class .
Cross Entropy Loss combines LogSoftmax and NLLLoss as . This is more numerically stable.
Binary Cross Entropy Loss is CE loss for 2 classes, where . Can be reduced with mean or sum. Requires .
Marking Ranking Loss / Ranking Loss / Contrastive Loss: predicts relative distance between pairs of inputs rather than absolute class labels. , where indicates if should be ranked higher than .
Triplet Margin Loss makes samples from the same classes close and different classes far away. Used for metric learning and Siamese networks: , where , where is the anchor, is a positive sample and is a negative sample.

The choice of a loss function depends on the desired output.

3.7 Optimisation

Dense layers are computationally and memory exnepsive. CNNs reduce this by using parameter sharing and sparse interactions. 1x1 convolutions act like a multi-layer perceptron per pixel.

3.8 Batch Normalization

Loss is calculated at the last layer, meaning the last layer learns the quickest. Dat ainput is at the first layer, so if the first layer changes, the last layer needs to relearn many times, causing slow convergence. This is called internal covariate shift.

To avoid changing last layers while learning first layers, we can fix mean and variance:

Then adjust it separately:

Where is the batch, and are learned parameters. This speeds up training and acts as a regularizer.

To also reduece covariate shift, we add noise injection to the inputs during training,: . This also removes the need for drouput. Ideal minibatch size is 64-256.

In a dense layer, one normalization for all channels. In a convolutional layer, one normalization per channel. A new mean and variance are computed for each minibatch.

3.9 Residual Networks

A ResNet ensures the function space of each layer includes the previous layer. This avoids gradient vanishing problems in deep networks. A residual block has a skip connection that outputs , where is the output of the layer.

Back to Home

Table of Contents