CNNs
CNNs introduce hierarchy and locality into neural networks.
1. Universal Approximator
Let
If
1.1 Curse of Dimensionality
As the number of features or dimensions grows, the amount of data we need to generalise accurately grows exponentially.
To approximate a Lipschitz (continuous) function
to accuracy requires training samples.
In the
- Beginning with no shrinking (
), and that is decreasing ( ), we see the initial rate of growth of the rind volume is . - Initially, the volume of the rind grows
times faster than rate at which the object is being shrunk. - In higher dimensions, tiny changes in distance translate to large changes in volume.
In higher dimensions
most Euclidean distances between observations in a dataset are nearly the same and close ( ) to the diameter of the region in which they are enclosed.
1.2 Invariance & Equivariance
Shift invariance describes a system's unchanging response to input shifts. For example, in image recognition, a shift-invariant system would recognize an object regardless of its position in the image. Assuming a function
Equivariance means that if the input changes in a certain way, the output changes in the same way. For example, in image processing, if an image is rotated, an equivariant system would produce an output that is also rotated in the same manner. A function
1.3 Inductive Bias
We can introduce two principles:
- Translation Invariance: a shift in the input should lead to a shift in the hidden representation.
- Locality: we should not have to look far away from a location to glean relevant information about that area.
These can be applied with a sliding window approach, using correlation
where is the image patch and is the kernel/filter.
1.4 Deformation Invariance
We may also want invariance to deformation. Here we would have a warp operator
2. Convolutions
- Convolution:
for . - Correlation:
for . - Discrete Convolution: given arrays
and , their convolution is . When or are undefined they are assumed to be zero.
A convolution is is commutative, associative, associative with scalar multiplication and distributive.
3. CNNs
In conventional NNs, the input tensor is flattened into a vector, but this loses locality & translation invariance principles.
3.1 Convolutional Layer
In a CNN, we keep locality as a
Each kernel produces its own activation map, and these are stacked to produce the output volume. If there are
A CNN is a sequence of convolutional layers interleaved with activation functions.
Each filter has
A
3.1.1 Stride & Padding
In a convolution:
- Stride is the number of pixels by which we slide the filter across the input image.
- Padding is the number of pixels added to the border of the input image. If you need to keep the
and dimensions then you can 0-pad the image.
3.1.2 Computational Complexity
As each convolution decreases the
This can be exploited by using multiple small filters to approximate a single large filter.
3.2 Pooling Layer
Pooling aggregates and downsamples the input sensor. The order of pixels does not matter, only their values. This introduces translational invariance. Some pooling methods are:
- Max Pooling: takes the maximum value in each patch. Max pooling can break shift equivariance. This can be partially solved by anti-aliasing (blurring) before downsampling.
- Average Pooling: takes the average value in each patch.
- etc.
This is applied to each channel separately.
3.3 Fully Connected Layer
A standard dense layer, where each neuron is connected to every neuron in the previous layer.
3.4 Flatten Layer
These connect the convolutional layers to the fully connected layers by flattening the 3D tensor into a 1D vector.