Introduction to Convolutional Neural Networks

Convolutional Neural Networks is one of the most succesfully and used Neural Network Algorithm. The three main components of the CNN are the Convolutional Layer, the Pooling Layer (used to reduce the computational space), and the Fully Connected Layer.
Image to have to classify 32x32 images. A single Fully-Connected Neuron in a first hidden layer would have 3131x3=3072 weights and this structure can not scale to larger images. Clearly this full connectivity is wastefull, and it quikly leads us to overfitting.

The key aspect of the CNN is that it has learnable weights and biases.
The layers has neuros arranged in three-dimentions: width, height, depth. So, an image is considered as an input of activations where the depth is 3 as the different color channels. To process an image, a typical approach is to convolve it with a filter, also called kernel in order to have only the salient features (e.g. the borders) of the original image.

From the figure above, in order to obtain the convoluted feature 4, we have to compute the linear combination of the original pixel i he input data and the corrisponding weight of the kernel. In CNn it is necessary to inctrocude a bit of terminolgy:
The Feature Maps is each slide of the convolutional layer alongh the width.
The Receptive Field is the local 3D-patch in the input volume to which a specific neuron is connected.
The Depth Column is the set of neurons that are all looking at the same region of the input.

each neuron is connected only to its local receptive field, and each neuron in each output feature map shares the same weights (i.e. it has the same filter). The reason why we need more than one ouptup feature map is because each filter can extract different features from the input image (e.g. orizontal edges or vertical edges).
In CNN is coomon to insert a Pooling Layer in-between secessive Convolution Layers. it reduce the amount of parameters in order to obtain only the useful information. For example, a pooling layer can devide the image in four area and report only the maximum value of each area (max pool).

The last fundamental layer in CNN is the Fully Connected Layer. Neurons in a fully connected layer have full connections to all activations in the preavious layer, as it s in regula ANN, and their activation can be computed via a matrix multiplication followed by a bias offset. The most effective activation function in this case is the Rectifier Function, or better the RELU Function that stands for Rectified Linear Unit Function.

In other words, the most common CNN architecture follows the following patter:
INPUT -> [[CONV->RELU]xN->POOL]xM->[FC->RELU]xK->FC
where:
N>=0 usually N=<3
M=>0
K=> usually K=<2

Form the schema above, we have after the input, a series of convolutional layers (N times) and inside this patter we use the RELU activation function. Then, we reduce the dimentions and we convert it in a signle vector via pooling layers (M times). Then we can add K number of fully connected layer.

The most common initial approach to start an architecture of a CNn is the following:
The most common Convolution Layers filter size is 3x3 and 5x5.
The most common Pool Layer fiter size is 2x2.

Despite the success of CNN, they are often referred as a black-box algorithm in which we can’t really tell what is going on under the hood. To face this problem, several approaches for understanding and visualizinf CNN have been developed in the literature.