CNNs
CNNs introduce hierarchy and locality into neural networks.
1. Universal Approximator
Let
If
1.1 Curse of Dimensionality
As the number of features or dimensions grows, the amount of data we need to generalise accurately grows exponentially.
To approximate a Lipschitz (continuous) function
to accuracy requires training samples.
In the
- Beginning with no shrinking (
), and that is decreasing ( ), we see the initial rate of growth of the rind volume is . - Initially, the volume of the rind grows
times faster than rate at which the object is being shrunk. - In higher dimensions, tiny changes in distance translate to large changes in volume.
In higher dimensions
most Euclidean distances between observations in a dataset are nearly the same and close ( ) to the diameter of the region in which they are enclosed.
1.2 Invariance & Equivariance
Shift invariance describes a system's unchanging response to input shifts. For example, in image recognition, a shift-invariant system would recognize an object regardless of its position in the image. Assuming a function
Equivariance means that if the input changes in a certain way, the output changes in the same way. For example, in image processing, if an image is rotated, an equivariant system would produce an output that is also rotated in the same manner. A function
1.3 Inductive Bias
We can introduce two principles:
- Translation Invariance: a shift in the input should lead to a shift in the hidden representation.
- Locality: we should not have to look far away from a location to glean relevant information about that area.
These can be applied with a sliding window approach, using correlation
where is the image patch and is the kernel/filter.
1.4 Deformation Invariance
We may also want invariance to deformation. Here we would have a warp operator
2. Convolutions
- Convolution:
for . - Correlation:
for . - Discrete Convolution: given arrays
and , their convolution is . When or are undefined they are assumed to be zero.
A convolution is is commutative, associative, associative with scalar multiplication and distributive.
3. CNNs
In conventional NNs, the input tensor is flattened into a vector, but this loses locality & translation invariance principles.
3.1 Convolutional Layer
In a CNN, we keep locality as a
Each kernel produces its own activation map, and these are stacked to produce the output volume. If there are
A CNN is a sequence of convolutional layers interleaved with activation functions.
Each filter has
A
3.1.1 Stride & Padding
In a convolution:
- Stride is the number of pixels by which we slide the filter across the input image.
- Padding is the number of pixels added to the border of the input image. If you need to keep the
and dimensions then you can 0-pad the image.
3.1.2 Computational Complexity
As each convolution decreases the
This can be exploited by using multiple small filters to approximate a single large filter.
3.2 Pooling Layer
Pooling aggregates and downsamples the input sensor. The order of pixels does not matter, only their values. This introduces translational invariance. Some pooling methods are:
- Max Pooling: takes the maximum value in each patch. Max pooling can break shift equivariance. This can be partially solved by anti-aliasing (blurring) before downsampling.
- Average Pooling: takes the average value in each patch.
- etc.
This is applied to each channel separately.
3.3 Fully Connected Layer
A standard dense layer, where each neuron is connected to every neuron in the previous layer.
3.4 Flatten Layer
These connect the convolutional layers to the fully connected layers by flattening the 3D tensor into a 1D vector.
3.5 Activation Functions
Determine how neurons respond to inputs, introducing non-linearity into the network. A neuron:
A naive approach is to set an activation threshold, but this is not differentiable, so we cannot do backpropagation.
A linear activation function
- Sigmoid
is non-linear. Its steepest at small , meaning smaller inputs have a larger effect. However, it suffers from vanishing gradients for large . - Tanh
is a scaled sigmoid, outputting . It is zero-centered, but still suffers from vanishing gradients. - ReLU
is effiicent, combinations are non linear. However it suffers from the dying ReLU problem, where neurons can get stuck outputting 0. - Leaky ReLU
for small (e.g. 0.01) mitigates the dying ReLU problem. - PReLU
where is learned during training. - Softplus
is a smooth approximation to ReLU. Output is always positive. If , then . - LogSigmoid
is the log of the sigmoid function. It is numerically stable for large negative . - Softmin
operates on vectors, producing a probability distribution. - Softmax
operates on vectors, producing a probability distribution. - LogSoftmax
is the logarithm of the softmax function, useful for numerical stability in classification tasks.
3.6 Loss Functions
Quantifies how well the network is doing. Some include:
- L2 Norm (MSE):
. Can be reduced either with mean or sum. Penalizes large errors more than small ones. - L1 Norm
. Can be reduced either with mean or sum. More robust to outliers than L2. - Smooth L1 Loss
, where . Combines L1 and L2. - Negative Log Likelihood: assume the network's output represents log likelihoods of each class. Then,
, where and is the weight for class . - Cross Entropy Loss combines LogSoftmax and NLLLoss as
. This is more numerically stable. - Binary Cross Entropy Loss is CE loss for 2 classes, where
. Can be reduced with mean or sum. Requires . - Marking Ranking Loss / Ranking Loss / Contrastive Loss: predicts relative distance between pairs of inputs rather than absolute class labels.
, where indicates if should be ranked higher than . - Triplet Margin Loss makes samples from the same classes close and different classes far away. Used for metric learning and Siamese networks:
, where , where is the anchor, is a positive sample and is a negative sample.
The choice of a loss function depends on the desired output.
3.7 Optimisation
Dense layers are computationally and memory exnepsive. CNNs reduce this by using parameter sharing and sparse interactions. 1x1 convolutions act like a multi-layer perceptron per pixel.
3.8 Batch Normalization
Loss is calculated at the last layer, meaning the last layer learns the quickest. Dat ainput is at the first layer, so if the first layer changes, the last layer needs to relearn many times, causing slow convergence. This is called internal covariate shift.
To avoid changing last layers while learning first layers, we can fix mean and variance:
Then adjust it separately:
Where
To also reduece covariate shift, we add noise injection to the inputs during training,:
In a dense layer, one normalization for all channels. In a convolutional layer, one normalization per channel. A new mean and variance are computed for each minibatch.
3.9 Residual Networks
A ResNet ensures the function space of each layer includes the previous layer. This avoids gradient vanishing problems in deep networks. A residual block has a skip connection that outputs