Now that you’ve seen an overview of basic CNN architecture, let’s explore the five layers of CNN architecture in detail.
Also Read: Exploring the Scope of Machine Learning: Trends, Applications, and Future Opportunities
Comprehensive Overview of the 5 Key Layers in CNN Architecture
The convolutional layer, pooling layer, fully connected layer, dropout layer, and activation functions work together in CNNs to extract features and classify data efficiently. Here’s a breakdown of all the five layers in CNN architecture.
Convolutional Layer
This layer performs a convolution operation, which is to apply filters to the input image to generate a feature map. It helps detect patterns such as edges or textures, thereby preserving spatial relationships between pixels. Maintaining the spatial features allows it to identify local features.
In this layer, a mathematical convolutional operation is performed between the image and the filter of size MxM. The filter slides over the input image, and the dot product of the filter and the part of the image is taken with respect to the filter size (MxM).
The output is called the feature map, which gives us information about the image such as the corners and edges. This information is then fed to other layers to learn several other features of the input image.
Example: Let’s consider an example of identifying whether an image contains a cat. This layer detects the cat's whiskers or ears in the image.
Also Read: Top 30 Machine Learning Skills for ML Engineer
Pooling Layer
The polling layer reduces the dimensions of the feature map without affecting the key features. Max Pooling (selecting the maximum value), Average Pooling (average value), and Sum Pooling (sum of values) are the common types used. Pooling summarizes the features generated by a convolution layer.
In Max Pooling, the largest element from the feature map is taken. Average Pooling calculates the average of the elements in a certain Image section. The Sum Pooling calculates the total sum of the elements in a predefined section. The Pooling Layer acts as a bridge between the Convolutional Layer and the FC Layer.
By generalizing the features extracted by the convolution layer, the poling layer helps the networks recognize the features independently and also reduces the chances of overfitting.
Example: For detecting a cat in an image, the pooling layer simplifies the whiskers' feature by summarising their presence in a smaller region.
Fully Connected Layer
This layer connects every neuron in one layer to every neuron in the next. The flattening process is used to convert all the multi-dimensional features into a one-dimensional vector. These layers in CNN reduce human supervision.
In this layer, the input images from the previous layers are flattened and fed to the FC layer. The flattened vector then passes through a few more FC layers where the mathematical functions operations usually take place. At this stage, the classification process begins to take place. The reason why two layers are connected is that two fully connected layers perform better than a single connected layer.
Example: For identifying a cat in an image, this layer checks if the detected features collectively represent a cat.
Dropout Layer
The dropout layer randomly deactivates a fraction of neurons during training to prevent overfitting. Overfitting occurs when a particular model works well on the training data, causing a negative impact when used on new data. This process ensures that the model generalizes well for unseen data.
A dropout layer is used wherein a few neurons are dropped from the neural network during the training process, reducing the model size. This layer ensures that the model reduces dependency on specific neurons.
Example: For the same example, 30% of neurons in a layer are turned off during each training iteration.
Activation Functions
The activation function introduces non-linearity (learn making conditional decisions for controlling the computational flow), thus helping the network to identify complex relationships in the data. The activation function decides which information of the model should fire in the forward direction and which ones should not at the end of the network.
ReLU, Softmax, and Sigmoid are common activation functions. Each of these functions has a specific usage. For a binary classification CNN model, sigmoid and softmax functions are preferred, while softmax is used for multi-class classification.
Example: ReLU makes sure that the model focuses only on meaningful features like a cat’s distinct patterns.
1. Convolutional Layer
The convolutional layer is crucial in CNNs (Convolutional Neural Networks) for extracting features from input images. It applies filters (kernels) to detect basic patterns, like edges, corners, and textures, while preserving the spatial relationships between pixels.
This feature extraction helps the network understand visual content, making it foundational for both CNN in machine learning and CNN in deep learning.
Key Concepts:
- Kernel (Filter): A small matrix (e.g., 3x3 or 5x5) that slides across the input image, detecting specific features. For example, a kernel might highlight vertical edges:
Vertical Edge Detection Kernel:
[[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]]
- Stride: Stride defines how many pixels the filter moves at each step. A stride of 1 means the filter moves one pixel at a time. Larger strides reduce the size of the output feature map.
- Padding: Padding helps control the spatial dimensions of the output. Same padding keeps the input and output dimensions equal, while valid padding reduces the output size.
How it works:
- The kernel slides over the image based on the stride.
- For each position, the filter performs element-wise multiplication with the image’s pixel values.
- The results are summed to produce a single value in the feature map.
- This process is repeated to create the full feature map, highlighting features such as edges or textures.
Example:
In a CNN designed to detect a cat in an image, the first convolutional layer may detect simple features like the cat’s ears or whiskers. Later layers combine these features to identify more complex patterns, like the shape of the cat’s face.
Code Example (TensorFlow - Keras):
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)))
- 32 is the number of filters.
- (3, 3) is the kernel size.
- ReLU is the activation function.
- input_shape defines the input image shape.
Why it matters:
The convolutional layer is the foundation of CNNs in machine learning and deep learning. It’s essential for image classification, object detection, and many other tasks that rely on identifying patterns in visual data.
The ability to extract meaningful features from images is what makes CNNs powerful for tasks like face recognition or medical image analysis.
2. Pooling Layer
The pooling layer reduces the spatial dimensions (height and width) of the feature map while preserving the essential features. This helps lower computational costs and reduces the risk of overfitting by summarizing the features extracted in the convolutional layer.
Key Concepts:
- Max Pooling: Selects the maximum value from a region of the feature map. This helps retain the most significant feature.
- Average Pooling: Takes the average value from a region of the feature map, providing a smooth approximation of the data.
- Sum Pooling: Sums up the values in a region of the feature map, though it’s less common than max or average pooling.
How it works:
The pooling layer divides the feature map into smaller sections, typically 2x2 or 3x3, and applies one of the pooling methods (max, average, or sum) to each section. This reduces the size of the feature map while retaining the most important information for further processing.
Example:
If the CNN is detecting a cat in an image, the pooling layer simplifies the whisker features by summarizing them into a smaller region, reducing the resolution but keeping the critical information. This helps the network focus on the most prominent features, like the shape of the whiskers, rather than detailed pixel-level information.
Why it matters:
The pooling layer acts as a bridge between the convolutional and fully connected layers, making the network more efficient. By reducing the feature map size, it helps the network generalize learned features better, leading to improved performance and reduced overfitting in tasks like image recognition.
3. Fully Connected Layer
The fully connected (FC) layer connects every neuron in one layer to every neuron in the next. It combines all extracted features to make final decisions. After convolution and pooling, feature maps are flattened into a one-dimensional vector. This vector is then passed through one or more FC layers, typically used for classification tasks.
Key Concepts:
- Flattening: Converts multi-dimensional feature maps into a one-dimensional vector, making it suitable for processing in the fully connected layers.
- Mathematical Operations: Each neuron in the FC layer performs weighted sums and applies an activation function (e.g., ReLU or softmax).
- Why Multiple FC Layers: Having two or more FC layers allows the network to learn more complex patterns and improve classification accuracy.
How it works:
The output of the convolutional and pooling layers is flattened into a vector. This vector is then passed through the fully connected layers, where the network learns to combine features and make final predictions, such as classification.
Example:
For a CNN designed to detect a cat, the fully connected layer checks if the combination of extracted features (like whiskers, ears, and eyes) collectively represent a cat. The output could be a probability value for the cat class.
Why it matters:
The fully connected layer is crucial for classification in CNNs. It integrates the learned features and makes the final decision about the image, playing a key role in tasks like object detection and classification.
4. Dropout Layer
The Dropout Layer randomly deactivates a fraction of neurons during training to avoid overfitting. This prevents the model from relying too much on specific neurons, helping it generalize better on unseen data. By forcing the model to learn redundant, robust features, it reduces the likelihood of overfitting.
Key Concepts:
- Random Deactivation: During training, a random subset of neurons is turned off (set to zero), preventing the model from relying too heavily on any particular neuron.
- Overfitting Prevention: By disabling certain neurons, the model is forced to learn redundant, robust features that work across different data inputs.
- Fraction of Neurons: Typically, 20-50% of neurons are dropped out during each training iteration.
How it works:
During each training iteration, a certain percentage of neurons (e.g., 30%) are "dropped" or turned off randomly. This helps reduce the model's reliance on specific features, improving its ability to generalize and perform well on new, unseen data.
Example:
In the case of identifying a cat in an image, during training, 30% of the neurons in a layer are turned off. This helps prevent the model from becoming overly reliant on specific features like the cat’s ears or whiskers, ensuring better performance on unseen images.
Why it matters:
The dropout layer is essential for improving generalization and reducing overfitting. It's particularly useful in deep learning models like CNNs, where the risk of overfitting is higher due to the large number of parameters being learned.
Also Read: What is Overfitting & Underfitting In Machine Learning? [Everything You Need to Learn]
5. Activation Functions
The activation function introduces non-linearity into the model, enabling it to capture complex relationships in the data. It helps the network decide which information should be passed forward and which should be ignored, influencing the flow of computation.
Key Concepts:
- Non-linearity: Activation functions allow the network to learn complex patterns and make conditional decisions, something a simple linear function cannot achieve.
- Types of Activation Functions:
- ReLU (Rectified Linear Unit): Often used for hidden layers. It outputs zero for negative values and the input for positive values.
- Sigmoid: Outputs values between 0 and 1, ideal for binary classification.
- Softmax: Converts the raw output of the model into a probability distribution for multi-class classification.
How it works:
Each neuron in the network applies an activation function to the weighted sum of its inputs. This determines whether the neuron should "fire" and pass information to the next layer. For instance, ReLU only allows positive values to pass through, which helps the network focus on significant features.
Example:
In a CNN tasked with detecting a cat in an image, ReLU ensures that only relevant features (e.g., the cat's distinct patterns) are passed forward. It filters out unnecessary information, helping the model focus on the most important aspects of the image.
Why it matters:
Activation functions are crucial for enabling neural networks to learn non-linear patterns, which is essential for tasks like image classification, speech recognition, and more. They determine the decision-making process of each neuron, making them fundamental in deep learning models like CNNs.
Now that you’ve explored the layers in CNN architecture, let’s understand how ReLU functions in CNN.
ReLU: A Key Activation Function in Convolutional Neural Networks
ReLU (Rectified Linear Unit) is the most widely used activation function in CNNs. It introduces non-linearity in CNN, thus allowing the network to learn and model complex patterns efficiently.
Here’s the ability of ReLU to introduce non-linearity in CNN.
- ReLU replaces all negative values in the input with zero while keeping positive values unchanged.
- This non-linear transformation process allows the network to learn and model complex, non-linear patterns in the data.
- Without non-linearity, a neural network would function like a linear regression model. This will limit its ability to solve real-world problems.
ReLU’s ability to introduce non-linearity allows the model to learn complex patterns in data. Here’s how ReLU impacts the learning of these patterns.
Focus on Relevant Features
ReLU removes irrelevant negative values, ensuring the network focuses on useful patterns.
Prevents Saturation
Unlike activation functions like tanH or Sigmoid, ReLU doesn’t saturate for large positive values. This allows better gradient flow during training.
Improves Computational Efficiency
ReLU’s simple mathematical operation increases training speed by reducing computation time.
Supports Deep Architectures
ReLU’s effectiveness in passing gradients helps prevent the vanishing gradient problem, making it suitable for deep networks.
Also Read: Everything you need to know about Activation Function in ML
Now that you understand ReLU and its role in enhancing CNN’s capabilities, let’s take a closer look at LeNet-5.
LeNet-5: A Key Type of CNN in Neural Network History
LeNet-5 was one of the first convolutional neural networks designed for handwritten digit recognition. It was introduced by Yann LeCun in 1998. LeNet-5 is said to have laid the foundation for modern deep-learning models.
LeNet-5, introduced by Yann LeCun in 1998, was a pioneering CNN for handwritten digit recognition, marking the start of modern deep learning. Its architecture laid the groundwork for models like AlexNet and VGG, which perform better on larger datasets due to deeper networks and improved computational power.
While LeNet-5 demonstrated CNN’s ability to recognize images with limited resources, it struggles with modern, complex tasks, which are better handled by deeper models like AlexNet and VGG.
Here’s an in-depth breakdown of the seven layers in the LeNet-5 architecture.