Convolutional Neural Networks usually called by the names such as ConvNets or CNN are one of the most commonly used Neural Network Architecture. CNNs are generally used for image based data. Image recognition, image classification, objects detection, etc., are some of the areas where CNNs are widely used.

The branch of Applied AI specifically over image data is termed as Computer Vision. There has been a monumental growth in Computer Vision since the introduction of CNNs. The first part of CNN extracts features from images using convolution and activation function for normalisation.

The last block uses these features with Neural Network to solve any specific problem, for example a classification problem will have ‘n’ number of output neurons depending on the number of classes present for classification. Let us try to understand the architecture and working of a CNN.

Table of Contents

**Convolution**

Convolution is an image processing technique which uses a weighted kernel (square matrix) to revolve over the image, multiply and add the kernel elements with image pixels. This method can be easily visualised by the image shown below.

Image by: Peltarion

Convolution filter and output

As we can see when we use a 3×3 convolution kennel, 3×3 part of the image is operated on and after multiplication and subsequent addition, one value comes as an output. So on a 4×4 image we’ll get a 2×2 convoluted matrix output given the kernel size is 3×3.

The convoluted output may vary upon the size of the kernel used for convolution. This is the typical starting layer of a CNN. The convoluted output is the features found from the image. This is directly related to the kernel size being used.

If the characteristic of an image is such that even small differences in an image will make it fall in a different output category then a small kernel size is used for feature extraction. Otherwise a bigger kernel can be used. The values used in the kernel are often termed as convolutional weights. These are initialized and then updated upon backpropagation using gradient descent.

**Read: **TensorFlow Object Detection Tutorial For Beginners

**Pooling**

The pooling layer is placed between convolution layers. It is responsible for performing pooling operations on the feature maps sent by a convolution layer. Pooling operation reduces the spatial size of the features also known as dimensionality reduction.

One of the major reasons for pooling is to decrease the required computational power to process the data. Although, a pooling layer reduces the size of the images it preserves their important characteristics. The working is similar to a CNN filter. The kernel goes over the features and aggregates the values covered by the filter.

From the image it is clearly visible that there can be various aggregation functions. Average and max pooling are the most commonly used pooling operations. Pooling reduces the dimensions of the features but keeps the characteristics intact.

By reducing the number of parameters, the calculations also reduce in the network. This reduces over-learning and increases the efficiency of the network. The max-pool is mostly used because max values are spotted less accurately in the pooled map compared to the maps from convolution.

This is good for many cases.Let us say if one want to recognize a dog, its ears do not need to be located as precisely as possible, knowing that they are located almost next to the head is enough.

Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand, Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.

**Activation Function**

ReLU (Rectified Linear Units) is the most commonly used activation function layer.

Equation for the same is: ReLU(x)=max(0,x)

And graphical representation is given below:

Source: Medium

ReLU representation

ReLU maps the negative values to zero and keeps the positives as it is.

**Fully Connected Layer**

A fully connected layer is usually the last layer of any neural network. This layer receives input vectors and produces a new output layer. This output layer has n number of neurons where n is the number of classes in the classification of the image. Each element of the vector provides the probability of the image being of a certain class. Hence the sum of all the vectors in the output layer is always 1.

The calculations happening in the output layer are as follows:

- Element multiplied by weight of the neuron
- Apply activation function on the layer (logistic when n=2, sigmoid when n>2)

The output will now be the probability of the image belonging to a certain class. The weights of the layer are learnt during training by backpropagation of the gradient.

**Also Read:** Neural Network Model Introduction

**Dropout Layer**

Dropout layers work as a regularisation layer that reduces overfitting and improves generalization error. Overfitting is a major concern while using a Neural Network. Dropout as the name suggests drops out some percentage of neuron in the layers after which it is used.

The regularization method employed by dropout is that it approximates training a large number of neural networks with different parallel architectures. During the training period some of the layer outputs are randomly dropped or ignored. This makes the layer look like a layer with different numbers of nodes and some neurons are turned off. Hence the connectivity also changes according to the previous layer.

**Hyperparameters**

There are certain parameters which can be controlled according to the image data being dealt. Each layer of a CNN can be parameterized, be it convolution layer or pooling layer. Parameters affect the size of the feature map that is the output for that specific layer.

Each image(input) or feature map(subsequent outputs of layers) are of the dimensions: W x H x D where W x H is width x height i.e. the size of the map or image. D represents dimension on the basis of color segments. Monochrome images will have D=1 and RGB i.e. colored images will have D=3.

**Convolution Layer hyperparameters**

- Number of filters (K)
- Size of the filter (F) of the dimension FxFxD
- Strides: Number of steps taken for the kernel to shift over the image. S=1 means that the kernel will move with 1 pixel as the step.
- Zero padding: zero padding is done for images having less size, because convolution and max pool layers reduce the size of the feature map on every iteration.

Source: XRDS

Zero padding increased the size of the input image

For each input image of size W×H×D, the pooling layer returns a matrix of dimensions Wc×Hc×Dc. Where

Wc= (W-F+2P)/S+1

Hc= (H-F+2P)/S+1

Dc= K

Solving the equations to find the value of Padding(P)=F-½ and Stride(S)=1

In general, we then choose F=3,P=1,S=1 or F=5,P=2,S=1

**Pooling Layer hyperparameters**

- Cell size (F): The square cell size in which the map will be divided for pooling. FxF
- Step size (S): Cells are separated by S pixels

For each input image of size W×H×D, the pooling layer returns a matrix of dimensions Wp×Hp×Dp, where

Wp= (W-F)/S+1

Hp= (H-F)/S+1

Dp= D

For the pooling layer, F=2 and S=2 is widely chosen. 75% of the input pixels are eliminated. One can also choose F=3 and S=2. Larger cell size will result in large loss of information, hence suitable only for very big sized input images.

**General hyperparameters**

- Learning rate: Optimizers like SGD, AdaGrad or RMSProp can be chosen to optimize learning rate.
- Epochs: Number of Epochs should be increased until a gap in training and validation error shows up
- Batch size: 16 to 128 can be selected. Depends on the amount of processing power that one has.
- Activation Function: Introduces non-linearity to the model. ReLu is typically used for Conv Nets. Other options are: sigmoid, tanh.
- Dropout: a dropout value of 0.1 drops 10% of the neurons. 0.5 is a good starting point. 0.25 is a good final option.
- Weight Initialisation: Small random weights can be initialised to deflect the possibility of dead neurons. But not too small for gradient descent. Uniform distribution is suited.
- Hidden layers: Hidden layers can be increased until the test error is decreasing. Increasing hidden layers will increase computation and require regularisation.

**Conclusion**

We have the basic information to create a CNN from scratch. Although it is a comprehensive article that covers everything on a basic level, each parameter or layer can be dived deeper into. The maths behind every concept is also something that can be understood for the betterment of the model

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.