CNNs

CNNs introduce hierarchy and locality into neural networks.

1. Universal Approximator

Let be a non-constant, bounded and monotonically increasing function. For any and any continuous function defined on a compact subset of , integer , real constants and real vectors for such that:

If is a sensible activation function, then we can see that any function can be approximated by a neural network with a single hidden layer and enough neurons. In practice, is very large & we suffer from the curse of dimensionality.

1.1 Curse of Dimensionality

As the number of features or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

To approximate a Lipschitz (continuous) function to accuracy requires training samples.

In the dimension the -dimensional volume of the interior will be times the volume of its original shape, where is the scaling factor. Therefore, the volume of the rind relative to the original volume is . As a function of , its rate of broth is , where is the differential operator.

Beginning with no shrinking (), and that is decreasing (), we see the initial rate of growth of the rind volume is .
Initially, the volume of the rind grows times faster than rate at which the object is being shrunk.
In higher dimensions, tiny changes in distance translate to large changes in volume.

In higher dimensions most Euclidean distances between observations in a dataset are nearly the same and close () to the diameter of the region in which they are enclosed.

1.2 Invariance & Equivariance

Shift invariance describes a system's unchanging response to input shifts. For example, in image recognition, a shift-invariant system would recognize an object regardless of its position in the image. Assuming a function , its shift invariant for shift operator (shifting image by ). Then, .

Equivariance means that if the input changes in a certain way, the output changes in the same way. For example, in image processing, if an image is rotated, an equivariant system would produce an output that is also rotated in the same manner. A function , its equivariant for shift operator if .

1.3 Inductive Bias

We can introduce two principles:

Translation Invariance: a shift in the input should lead to a shift in the hidden representation.
Locality: we should not have to look far away from a location to glean relevant information about that area.

These can be applied with a sliding window approach, using correlation where is the image patch and is the kernel/filter.

1.4 Deformation Invariance

We may also want invariance to deformation. Here we would have a warp operator which warps the image by field . Then, .

2. Convolutions

Convolution: for .
Correlation: for .
Discrete Convolution: given arrays and , their convolution is . When or are undefined they are assumed to be zero.

A convolution is is commutative, associative, associative with scalar multiplication and distributive.

3. CNNs

In conventional NNs, the input tensor is flattened into a vector, but this loses locality & translation invariance principles.

3.1 Convolutional Layer

In a CNN, we keep locality as a convolution, where the input tensor has width , height and depth (channels). The kernel/filter has width , height and depth . The output tensor (activation map) has width and height . In practice, convolutions are implemented as a series of dot products.

Each kernel produces its own activation map, and these are stacked to produce the output volume. If there are kernels, the output tensor has depth .

A CNN is a sequence of convolutional layers interleaved with activation functions.

Each filter has parameters, including the bias term.

A convolution reduces the depth of the network, aggregating many channels into one.

3.1.1 Stride & Padding

In a convolution:

Stride is the number of pixels by which we slide the filter across the input image.
Padding is the number of pixels added to the border of the input image. If you need to keep the and dimensions then you can 0-pad the image.

3.1.2 Computational Complexity

As each convolution decreases the and dimensions, the computational cost of each layer decreases, saving operations.

This can be exploited by using multiple small filters to approximate a single large filter.

3.2 Pooling Layer

Pooling aggregates and downsamples the input sensor. The order of pixels does not matter, only their values. This introduces translational invariance. Some pooling methods are:

Max Pooling: takes the maximum value in each patch. Max pooling can break shift equivariance. This can be partially solved by anti-aliasing (blurring) before downsampling.
Average Pooling: takes the average value in each patch.
etc.

This is applied to each channel separately.

3.3 Fully Connected Layer

A standard dense layer, where each neuron is connected to every neuron in the previous layer.

3.4 Flatten Layer

These connect the convolutional layers to the fully connected layers by flattening the 3D tensor into a 1D vector.

Table of Contents