A quick Google search of “data science” will unambiguously reveal to anyone how
popular the field has become in the last five years. Along with data science, artificial
intelligence, machine learning, and deep learning are also doing popular rounds in
the computer science field. The latest to be added to this list is convolutional neural
networks— an innovation from the field of computer vision.
Where it all started?
Neural networks actually became a hit in 2012 when Alex Krizhevsky won the
ImageNet competition that year. This competition is akin to the Olympics of computer
vision and when Alex used them, the classification error dropped from 26% to 15%.
This was The Unmistakeable Laser Ray of Hope that the companies and computer
scientists needed. Since then, companies like Instagram, Facebook, Pinterest, etc.
have enthusiastically implemented neural networks to provide the best experience to
their audience. Read: Neural Network Tutorial.
The biological connection of convolutional neural networks will also help to make its
foundation clear. In 1962, Hubel and Wiesel showed that different neurons in the
visual cortex were fired only when specific visual cues were present. Together, these
neurons had a columnar structure and when fired, collectively produced visual
For example, some neurons only fired when they were exposed to horizontal edges.
Others fired in the presence of vertical or diagonal edges. Thus, different neurons
responded to different visual components and enabled us to see.
What is a Convolutional Neural Network?
A convolutional neural network— also called CNN or ConvNet, is a Deep Learning
algorithm. It takes an input image, assigns weights/ biases to the components of the
image, and then classifies the entire image. With enough training, ConvNets are
capable of learning filters/ classification and the pre-processing required is lower as
compared to other algorithms. Read about differences between deep learning and neural networks.
What we ultimately want a convolutional neural network to do is to differentiate
between images and classify them correctly. It is able to capture both temporal and
spatial dependencies because of the application of relevant filters.
The Basics of How it Works
The image becomes an array depending on the resolution and size of the image.
Each entry in the array will consist of a number from 0 to 255 (if the RGB system is
used). This number will represent the pixel intensity at that point.
Taking all these numbers as input, the computer will output a number. This number
will signify the probability of an image belonging to a certain class (for example house,
road, bus, dog, cat, etc.)
Structure of a CNN
Seeing the above image, you might think there are a lot of layers in a convolutional
neural network, but in reality, there are only 3 major ones. These include:
1. The convolutional layer
2. The pooling layer
3. The fully connected layer
Let’s dive deeper into each one of these.
The convolutional layer
This is the core layer of the convolutional neural network. Its parameters are
composed of a set of filters. These filters are small, but they cover the full depth of the
The main task performed at the convolutional layer is the extraction of high-level
features. The first one (as shown in the image above) is responsible for extracting low-
level features like color, edges, etc. The subsequent convolutional layers take out the
high-level features, thus, leading to a complete understanding/ perusal of the image.
The Pooling Layer
This layer is meant to reduce the spatial size of the image representation. As such, it
also helps to reduce the computation and processing amount in the neural network.
Additionally, it also extracts dominant features that are positionally and rotationally
One type of pooling is done by using the Max operation. This operation picks the
maximum value from each neuron cluster at the prior layer. The other type of pooling
is the Average pooling which returns an average value from the cluster.
Since Max pooling also acts as a noise suppressant, it performs better than Average
As is depicted in the image above, there are multiple pooling layers in addition to
convolutional layers. Greater the number of these layers, the more low-level features
will be extracted. However, computational power expended will also increase.
Now that the image has passed through all the present convolutional and pooling
layers, feature extraction is complete. It is now time for the classification of the image. The Fully Connected Layer carries out this task.
The Fully Connected Layers (FCL)
As the last layer, the FC layer is simply a feed-forward neural network. The input to
the fully connected layer is the flattened output of the last pooling/ convolutional
layer. To flatten means that the 3-dimensional matrix or array is unrolled into a vector.
For each FC layer, a specific mathematical calculation takes place. After the vector has passed through all the fully connected layers, the softmax activation function is used in the final layer. This is used to compute the probability of the input belonging to a particular task.
Thus, the end result is the different probabilities of the input image belonging to different classes.
The process is repeated for different types of images and individual images within those types. This trains the network and teaches it to differentiate between a dog and a cat, and a rose and a sunflower.
The underlying technology of convolutional neural networks is being continuously refined. The networks are heavily trained so as to output accurate probabilities. It can be rightly said: in the field of computer vision, CNNs spell a revolution alone.
You can check our PG Diploma in Machine Learning and AI, which provides practical hands-on workshops, one-to-one industry mentor, 12 case studies and assignments, IIIT-B Alumni status, and more.