Beginner’s Guide for Convolutional Neural Network (CNN)
Updated on Aug 20, 2025 | 11 min read | 6.33K+ views
Share:
For working professionals
For fresh graduates
More
Updated on Aug 20, 2025 | 11 min read | 6.33K+ views
Share:
For decades, teaching a computer to see was one of the biggest challenges in artificial intelligence. How can a machine learn to recognize a cat in a photo, no matter its color, size, or angle? The breakthrough came with a specialized architecture designed specifically for visual data.
This powerful architecture is the Convolutional Neural Network (CNN), also known as ConvNet. It's a type of deep learning model that mimics the human visual cortex, allowing it to process and interpret images with incredible accuracy. As the backbone of modern computer vision, the Convolutional Neural Network in deep learning is used for everything from medical imaging to social media photo tagging. Let's explore how it all works!
Ready to take your deep learning skills to the next level?? Enroll in upGrad's Artificial Intelligence & Machine Learning - AI ML Courses to gain hands-on experience in NLP, deep learning, neural networks, and more. Get job-ready today
CNNs are inspired by our visual cortex. It is the area of the cerebral cortex that is involved in visual processing in our brain. The visual cortex has various small cellular regions that are sensitive to visual stimuli.
This idea was expanded in 1962 by Hubel and Wiesel in an experiment where it was found that different distinct neuronal cells respond (get fired) to the presence of distinct edges of a specific orientation. For instance, some neurons would fire on detecting horizontal edges, others on detecting diagonal edges, and some others would fire when they detect vertical edges. Through this experiment. Hubel and Wiesel found out that the neurons are organized in a modular manner, and all the modules together are required for producing the visual perception.
Ready to move beyond the theory and start building powerful computer vision models? The following upGrad expert-led programs will help you master Convolutional Neural Networks (CNNs) while enhancing your skills in AI, deep learning, and real-world application development.
Popular AI Programs
This modular approach – the idea that specialized components inside a system have specific tasks – is what forms the basis of the CNNs.
With that settled, let’s move on to how CNNs learn to perceive visual inputs.
Images are composed of individual pixels, which is a representation between numbers 0 and 255. So, any image that you see can be converted into a proper digital representation by using these numbers – and that is how computers, too, work with images.
Here are some major operations that go into making a CNN learn for image detection or classification. This will give you an idea of how learning takes place in CNNs.
Convolution can mathematically be understood as the combined integration of two different functions to find out how the influence of the different function or modify one another. Here’s how it can be defined in mathematical terms:
The purpose of convolution is to detect different visual features in the images, like lines, edges, colors, shadows, and more. This is a very useful property because once your CNN has learned the characteristics of a particular feature in the image, it can later recognize that feature in any other part of the image.
CNNs utilize kernels or filters to detect the different features that are present in any image. Kernels are just a matrix of distinct values (known as weights in the world of Artificial Neural Networks) trained to detect specific features. The filter moves over the entire image to check if the presence of any feature is detected or not. The filter carries out the convolution operation to provide a final value that represents how confident it is that a particular feature is present.
If a feature is present in the image, the result of the convolution operation is a positive number with a high value. If the feature is absent, the convolution operation results in either 0 or a very low-valued number.
Let’s understand this better using an example. In the below image, a filter has been trained for detecting a plus sign. Then, the filter is passed over the original image. Since a part of the original image contains the same features that the filter is trained for, the values in each cell where the feature exists is a positive number. Likewise, the result of a convolution operation will also result in a large number.
However, when the same filter is passed over an image with a different set of features and edges, the output of a convolution operation will be lower – implying there wasn’t any strong presence of any plus sign in the image.
So, in the case of complex images having various features like curves, edges, colours, and so on, we’ll need an N number of such feature detectors.
When this filter is passed through the image, a feature map is generated which is basically the output matrix that stores the convolutions of this filter over different parts of the image. In the case of many filters, we’ll end up with a 3D output. This filter should have the same number of channels as the input image for the convolution operation to take place.
Further, a filter can be slid over the input image at different intervals, using a stride value. The stride value informs how much the filter should move at each step.
The number of output layers of a given convolutional block can therefore be determined using the following formula:
One issue while working with convolutional layers is that some pixels tend to be lost on the perimeter of the original image. Since generally, the filters used are small, the pixels lost per filter might be a few, but this adds up as we apply different convolutional layers, resulting in many pixels lost.
The concept of padding is about adding extra pixels to the image while a filter of a CNN is processing it. This is one solution to help the filter in image processing – by padding the image with zeroes to allow for more space for the kernel to cover the entire image. By adding zero paddings to the filters, the image processing by CNN is much more accurate and exact.
Check the image above – padding has been done by adding additional zeroes at the boundary of the input image. This enables the capture of all the distinct features without losing any pixels.
The feature maps need to be passed through a mapping function that is non-linear in nature. The feature maps are included with a bias term and then passed through the activation (ReLu) function, which is non-linear. This function aims to bring some amount of nonlinearity into the CNN since the images that are being detected and examined are also non-linear in nature, being composed of different objects.
Once the activation phase is over, we move on to the pooling step, wherein the CNN down-samples the convolved features, which help save processing time. This also helps in reducing the overall size of the image, overfitting, and other issues that would occur if the Convoluted Neural Networks are fed with a lot of information – especially if that information is not too relevant in classifying or detecting the image.
Pooling is basically of two types – max pooling and min pooling. In the former, a window is passed over the image according to a set stride value, and at each step, the maximum value included in the window is pooled in the output matrix. In the min pooling, the minimum values are pooled in the output matrix.
The new matrix that’s formed as a result of the outputs is called a pooled feature map.
Out of min and max pooling, one benefit of max-pooling is that it allows the CNN to focus on a few neurons which have high values instead of focusing on all the neurons. Such an approach makes it very less likely to overfit the training data and makes the overall prediction and generalization go well.
After the pooling is done, the 3D representation of the image has now been converted into a feature vector. This is then passed into a multi-layer perceptron to produce the output. Check out the image below to better understand the flattening operation:
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
As you can see, the rows of the matrix are concatenated into a single feature vector. If multiple input layers are present, all the rows are connected to form a longer flattened feature vector.
In this step, the flattened map is fed to a neural network. The complete connection of a neural network includes an input layer, the FCL, and a final output layer. The fully connected layer can be understood as the hidden layers in Artificial Neural Networks, except, unlike hidden layers, these layers are fully connected. The information passes through the entire network, and a prediction error is calculated. This error is then sent as feedback (backpropagation) through the systems to adjust weights and improve the final output, to make it more accurate.
The final output obtained from the above layer of the neural network doesn’t generally add up to one. These outputs need to be brought down to numbers in the range of [0,1] – which will then represent the probabilities of each class. For this, the Softmax function is used.
The output obtained from the dense layer is fed to the Softmax activation function. Through this, all the final outputs are mapped to a vector where the sum of all the elements comes out to be one.
The fully connected layer works by looking at the previous layer’s output and then determining which feature most correlates to a specific class. Thus, if the program predicts whether or not an image contains a cat, it will have high values in the activation maps that represent features like four legs, paws, tail, and so on. Likewise, if the program is predicting something else, it will have different types of activation maps. A fully connected layer takes care of the different features that strongly correlate to particular classes and weights so that the computation between weights and the previous layer is accurate, and you get correct probabilities for distinct classes of output.
Here’s a quick summary of the entire process of how CNN works and helps in computer vision:
Understanding the Convolutional Neural Network (CNN) is your ticket to the exciting and innovative world of computer vision. It’s a field that is actively shaping the future of technology, attracting curious and talented minds from every possible background.
The best part is that the AI industry is incredibly welcoming and values practical skills over a specific degree. Take the foundational knowledge you've learned about how a Convolutional Neural Network (CNN) works, start experimenting, and you’ll be well on your way to a rewarding career.
If you wish to master the nitty-gritty of ML and AI, the ideal course of action would be to enroll in a professional AI/ML program. For instance, our Executive Programme in Machine Learning and AI is the perfect course for data science aspirants. The program covers subjects like statistics and exploratory data analytics, machine learning, and natural language processing. Also, it includes over 13 industry projects, 25+ live sessions, and 6 capstone projects. The best part about this course is that you get to interact with peers from across the world. It facilitates the exchange of ideas and helps learners build lasting connections with people from diverse backgrounds. Our 360-degree career assistance is just what you need to excel in your ML and AI journey!
Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.
Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.
Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.
A Convolutional Neural Network (CNN) is a specialized type of deep learning model designed to process and analyze visual data, like images and videos. Inspired by the human visual cortex, it automatically learns to detect patterns, features, and objects within an image, making it the powerhouse behind modern computer vision tasks.
A regular neural network treats input data as a flat vector, which loses the spatial relationships between pixels in an image. A Convolutional Neural Network (CNN), however, is designed to preserve these spatial hierarchies. It uses special layers (convolutional and pooling) to process the image in small chunks, allowing it to recognize features like edges, corners, and textures, and then combine them to identify more complex objects.
CNNs are effective because they use a concept called "parameter sharing." Instead of having every neuron connect to every pixel, a CNN uses a small filter (or kernel) that slides over the entire image to detect a specific feature (like a vertical edge). This single filter is reused across the whole image, which dramatically reduces the number of parameters the model needs to learn, making it much more efficient and scalable for visual data.
The Convolutional Layer is the core building block of the network. Its job is to apply a set of learnable filters to the input image. Each filter is a small matrix of weights that is trained to detect a specific feature, such as an edge, a color blob, or a texture. As the filter convolves (slides) over the image, it produces a "feature map" that highlights the locations where its specific feature was detected.
A feature map is the output of a convolutional layer after a filter has been applied to the input image. It's essentially a 2D map where each value indicates the strength or presence of the specific feature that the filter was designed to detect. For example, a filter trained to find horizontal lines will produce a feature map with high activation values in areas of the image where horizontal lines are present.
The Pooling Layer, also known as a downsampling layer, is used to reduce the spatial dimensions (width and height) of the feature maps. The most common type, Max Pooling, takes a small window of pixels and carries forward only the maximum value. This has two main benefits: it makes the model more computationally efficient and helps it become more robust to variations in the position of features in the image (a property called translation invariance).
ReLU stands for Rectified Linear Unit. It's an activation function that is applied after the convolutional layer. Its job is very simple: it takes an input value and returns the value if it's positive, and returns zero if it's negative (f(x) = max(0, x)). This is a crucial step in a Convolutional Neural Network in deep learning because it introduces non-linearity into the model, allowing it to learn much more complex patterns than it could with linear operations alone.
The Fully Connected Layer is typically located at the end of a Convolutional Neural Network (CNN) architecture. After the convolutional and pooling layers have extracted features from the image and reduced their dimensionality, the resulting feature maps are flattened into a one-dimensional vector. This vector is then fed into one or more Fully Connected Layers, which act like a traditional neural network to perform the final classification task (e.g., deciding if the image contains a cat or a dog).
Image classification is the task of assigning a label (or class) to an entire image, such as "Cat," "Dog," or "Car." A Convolutional Neural Network (CNN) performs this by first passing the image through a series of convolutional and pooling layers to learn a hierarchy of features. The final Fully Connected Layer then takes these high-level features and uses a softmax activation function to output a probability score for each possible class. The class with the highest probability is chosen as the final prediction.
Object detection is a more complex task than classification. It involves not only identifying what objects are in an image but also locating their position with a bounding box. Specialized architectures based on the Convolutional Neural Network (CNN), such as R-CNN, Fast R-CNN, and YOLO (You Only Look Once), are designed specifically for this purpose. They use the feature extraction power of a CNN to both classify objects and predict their coordinates.
Padding is the process of adding extra pixels (usually with a value of zero) around the border of an input image before applying a convolution. It serves two main purposes. First, it allows the filter to process the pixels at the very edge of the image more effectively. Second, it can be used to control the output size of the feature map, often to ensure that the output has the same width and height as the input (known as "same" padding).
Stride refers to the number of pixels the filter moves (or "slides") at a time as it convolves over the input image. A stride of 1 means the filter moves one pixel at a time. A stride of 2 means it moves two pixels at a time, which will produce a smaller output feature map. Stride can be used as an alternative to pooling for down sampling the spatial dimensions of the data.
Yes, absolutely. While they were designed for 2D image data, the core concepts of a Convolutional Neural Network (CNN) can be adapted for other data types. 1D CNNs are highly effective for analyzing time-series data or text, where the model can detect patterns over a sequence. 3D CNNs are used for processing volumetric data, such as in medical scans (MRIs) or video analysis.
The features in the filters are not pre-programmed; they are learned automatically during the training process. The network starts with random values in its filters. As it processes training images and makes predictions, it uses an optimization algorithm (like backpropagation and gradient descent) to compare its prediction to the true label. It then slightly adjusts the values in its filters to make its next prediction more accurate. Over thousands of iterations, these filters evolve to become effective feature detectors.
Dropout is a regularization technique used to prevent overfitting in a neural network. During training, it randomly "drops out" (sets to zero) a certain percentage of neurons in a layer for each training step. This forces the network to learn more robust and redundant features, as it cannot become too reliant on any single neuron. It is a very effective way to improve the generalization of a Convolutional Neural Network in deep learning.
CNNs are used in a vast range of real-world applications. These include facial recognition systems for security, self-driving cars for object detection, medical imaging analysis for detecting diseases like cancer, agricultural technology for monitoring crop health, and in social media platforms for automatically tagging photos and filtering content.
Deep learning models, including CNNs, are data-hungry. To train a Convolutional Neural Network (CNN) from scratch to achieve high accuracy, you often need thousands or even millions of labeled images. However, for many practical applications, a technique called "transfer learning" is used. This involves taking a powerful, pre-trained CNN and fine-tuning it on a much smaller, specific dataset, which can achieve excellent results with far less data.
Transfer learning is a technique where a model that was pre-trained on a very large dataset (like ImageNet, which has millions of images) is used as the starting point for a new, different task. The early layers of a pre-trained Convolutional Neural Network (CNN) have already learned to detect generic features like edges and textures. By using these pre-trained layers and only retraining the final layers on your specific dataset, you can build a highly accurate model with much less data and computational resources.
The "depth" of a CNN refers to the number of layers it has. A shallow CNN might have only a few convolutional and pooling layers. A deep CNN, like ResNet or VGG, can have dozens or even hundreds of layers. Deeper networks are capable of learning a more complex and abstract hierarchy of features, which generally allows them to achieve higher accuracy on more challenging tasks, provided they are trained on enough data.
A Convolutional Neural Network in deep learning is primarily used for supervised learning tasks. This means it is trained on a large dataset of labeled examples, where each image has a corresponding correct label (e.g., an image of a cat is labeled "cat"). The network learns to map the input image to the correct output label. While there are some applications of CNNs in unsupervised contexts (like in autoencoders), their most common and powerful use is in supervised classification and detection tasks.
900 articles published
Pavan Vadapalli is the Director of Engineering , bringing over 18 years of experience in software engineering, technology leadership, and startup innovation. Holding a B.Tech and an MBA from the India...
Speak with AI & ML expert
By submitting, I accept the T&C and
Privacy Policy
Top Resources