What is Batch Normalization?
By Rahul Singh
Updated on Jun 01, 2026 | 10 min read | 4.22K+ views
Share:
Looks like you're browsing from the
United StatesSome programs may not be available in your location
Some programs may not be available in your location
Switch to upGrad USAll courses
Certifications
More
By Rahul Singh
Updated on Jun 01, 2026 | 10 min read | 4.22K+ views
Share:
Table of Contents
Batch normalization is a widely used technique in deep learning that helps neural networks train faster and more reliably. It works by normalizing the inputs passed between layers, ensuring that data maintains a stable distribution throughout the training process.
By keeping activations centered and scaled appropriately, batch normalization reduces issues such as vanishing and exploding gradients. This leads to smoother learning, faster convergence, and improved performance in deep neural networks.
In this blog, you'll learn what is batch normalization, why it matters, how it works mathematically, and where it is used.
Build practical AI and ML skills with upGrad’s Artificial Intelligence Courses. Learn machine learning, generative AI, and emerging technologies while working on real-world projects.
Batch normalization is a technique that normalizes the output of each layer in a neural network before passing it to the next layer. It was introduced in a 2015 paper by Sergey Ioffe and Christian Szegedy at Google, and it quickly became one of the most widely used tools in deep learning.
The core idea is simple. During training, the distribution of inputs to each layer keeps shifting as the weights of the previous layers change. This is called internal covariate shift. Every time the weights update, the inputs seen by the next layer look slightly different. The layer has to keep adapting to this moving target, which slows down learning significantly.
Batch normalization solves this by rescaling the inputs to each layer so they consistently have a mean of zero and a standard deviation of one, using the statistics from the current mini-batch. Then it applies two learnable parameters to restore the model's ability to represent any distribution it needs.
Also Read: Deep Learning Architecture: CNN, RNN, and Transformers
Before batch normalization in deep learning became standard, training deep networks required:
With batch normalization, many of these concerns are reduced. Training becomes faster, more stable, and less sensitive to the initial setup.
Batch normalization is applied after the linear transformation (the matrix multiplication) and before the activation function, although some implementations place it after the activation. The placement can vary by architecture, but the before-activation position is the one described in the original paper.
| Layer Type | Typical Position of Batch Normalization |
| Fully connected layer | After linear transform, before activation |
| Convolutional layer | After convolution, before activation |
| Recurrent layer | Less common; layer normalization preferred |
Understanding what batch normalization does mathematically is essential. The process happens in four clear steps during training.
For a mini-batch of m samples, compute the mean of each feature:
μ_B = (1/m) * Σ x_i
This gives a single mean value per feature across all samples in the batch.
σ²_B = (1/m) * Σ (x_i - μ_B)²
This measures how spread out the values are for each feature in the batch.
x̂_i = (x_i - μ_B) / sqrt(σ²_B + ε)
The small constant ε (epsilon, usually 1e-5) is added to avoid division by zero when the variance is very small.
After this step, each feature in the batch has mean 0 and variance 1.
y_i = γ * x̂_i + β
Here, γ (gamma) and β (beta) are learnable parameters. These are what give the network flexibility. Without them, batch normalization would always force every layer's output to have mean 0 and variance 1, which could actually hurt the model's ability to represent complex functions. With gamma and beta, the network can learn whatever distribution is optimal for each layer.
Also Read: What Are Activation Functions in Neural Networks? Functioning,Types & Real-world Examples
During inference, there is no mini-batch to compute statistics from. Instead, the algorithm uses running averages of the mean and variance collected during training. These are tracked using exponential moving averages and are frozen once training is complete.
| Phase | Mean and Variance Source |
| Training | Computed from current mini-batch |
| Inference | Stored running averages from training |
Seeing the code makes the concept much more concrete. Here is how batch normalization is used in practice, both from scratch and with frameworks.
This is a simple manual implementation to build intuition in Python:
import numpy as np
def batch_norm_forward(X, gamma, beta, eps=1e-5):
# Step 1: Compute batch mean
mu = np.mean(X, axis=0)
# Step 2: Compute batch variance
var = np.var(X, axis=0)
# Step 3: Normalize
X_norm = (X - mu) / np.sqrt(var + eps)
# Step 4: Scale and shift
out = gamma * X_norm + beta
return out, X_norm, mu, var
This captures the full forward pass. During backpropagation, gradients flow through gamma and beta to update them along with all other network weights.
In PyTorch, batch normalization is one line:
import torch
import torch.nn as nn
# For fully connected layers (1D input)
bn = nn.BatchNorm1d(num_features=128)
# For convolutional layers (2D feature maps)
bn_conv = nn.BatchNorm2d(num_features=64)
# Inside a model
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.bn1 = nn.BatchNorm1d(256)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = self.fc1(x)
x = self.bn1(x) # Normalize before activation
x = self.relu(x)
x = self.fc2(x)
return x
import tensorflow as tf
from tensorflow.keras import layers
model = tf.keras.Sequential([
layers.Dense(256),
layers.BatchNormalization(), # Added after Dense, before activation
layers.Activation('relu'),
layers.Dense(10, activation='softmax')
])
| Parameter | What It Controls |
| momentum | Controls how fast running statistics update (default ~0.1) |
| eps | Small constant for numerical stability (default 1e-5) |
| affine / scale | Whether gamma and beta are learnable (default: True) |
| training | Whether to use batch stats or running averages |
Also Read: TensorFlow Tutorial
Batch normalization in deep learning offers several concrete advantages that explain why it became a standard component in nearly every modern architecture.
Because inputs to each layer are normalized, the optimizer does not have to compensate for varying input scales. This allows the use of higher learning rates without divergence. In practice, training often converges in significantly fewer epochs.
Before batch normalization, poor weight initialization could lead to vanishing or exploding gradients, especially in very deep networks. Batch normalization reduces this sensitivity because it standardizes inputs at each layer, making the gradient flow more predictable regardless of how the weights started.
Batch normalization introduces noise into the training process because the mean and variance are computed from a mini-batch, not the full dataset. This noise acts as a mild regularizer, similar to dropout. In many networks, adding batch normalization allows you to reduce dropout or remove it entirely.
Without normalization, training networks deeper than 20 or 30 layers was very difficult. With batch normalization, architectures like ResNet (which has 50 to 152 layers) became trainable. It is a key reason why the deep learning field was able to scale to the architectures we have today.
Also Read: Neural Network Model: Brief Introduction, Glossary & Backpropagation
Batch normalization is not the only normalization method. Several alternatives exist, each designed to handle cases where batch normalization falls short.
Batch normalization has a known limitation: it struggles when the batch size is very small. With a batch size of 1 or 2, the mean and variance estimates are too noisy to be useful. It also performs poorly in recurrent neural networks, where the sequence length varies and the batch statistics are hard to track.
Comparison Table
| Technique | Normalizes Over | Best Use Case |
| Batch Normalization | Batch dimension | CNNs, large-batch training |
| Layer Normalization | Feature dimension | |
| Instance Normalization | Single sample, spatial dims | Style transfer, image generation |
| Group Normalization | Groups of channels | Small batch sizes, detection |
Layer normalization computes mean and variance over all features for a single sample, not across the batch. This makes it independent of batch size, which is why it became the standard in Transformer models and large language models. If you have worked with BERT or GPT, every layer uses layer normalization, not batch normalization.
Group normalization divides channels into groups and normalizes within each group. It was developed specifically for object detection tasks where batch sizes are small due to memory constraints from high-resolution images. It consistently outperforms batch normalization when batch sizes drop below 8.
Also Read: Guide to CNN Deep Learning
Batch normalization was one of the most important contributions to deep learning in the last decade. It addressed a real and fundamental problem: the instability of training deep networks when the distribution of inputs to each layer keeps changing. By normalizing activations at each layer using mini-batch statistics and introducing two learnable parameters to restore representational power, it made deep network training faster, more stable, and far more accessible.
If you are building or fine-tuning neural networks, batch normalization belongs in your toolkit. Know when to use it, know when to replace it, and know what it is actually doing to your data at each step.
Want personalized guidance on AI and upskilling? Speak with an expert for a free 1:1 counselling session today.
Applying batch normalization before activation keeps inputs within a stable range. If applied after ReLU, negative values are already removed. Normalizing first improves gradient flow and usually helps the network learn faster and more effectively.
With a batch size of 1, variance becomes zero and batch statistics lose meaning. Although epsilon provides numerical stability, normalization becomes ineffective. In such cases, layer normalization or instance normalization is generally preferred.
Yes. Gamma and beta are trainable parameters initialized to 1 and 0. During training, gradients flow through them, and the optimizer updates them alongside the network's weights and biases.
Batch normalization often speeds up training by maintaining stable input distributions across layers. This allows larger learning rates and reduces optimization difficulties. Many models converge in fewer epochs compared to those without batch normalization.
No. Data normalization is applied to input features before training. Batch normalization occurs inside the network during training and normalizes intermediate activations using mini-batch statistics. Both techniques serve different purposes and are often used together.
Yes. Batch normalization is widely used in CNNs. It normalizes activations across batches and spatial dimensions for each channel. Most modern image recognition architectures include batch normalization after convolutional layers.
Not entirely. Batch normalization introduces a mild regularization effect through mini-batch statistics. Many models can reduce dropout usage, but when overfitting is significant, both batch normalization and dropout are often used together.
During inference, predictions are often made one sample at a time. Since batch statistics are unavailable, the model uses running averages of the mean and variance collected during training for stable and consistent predictions.
Batch normalization normalizes layer activations using batch statistics. Weight normalization modifies the weight vectors by separating magnitude and direction. It does not depend on batch size and can be useful in specific architectures.
Batch normalization keeps activations within a reasonable range across layers. This helps maintain stronger gradient signals during backpropagation, reducing the risk of vanishing gradients and making very deep networks easier to train.
No. Transformer models use layer normalization instead of batch normalization. Layer normalization operates across feature dimensions for individual tokens, making it better suited for variable-length sequences used in NLP models.
49 articles published
Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...
India’s #1 Tech University
Executive Program in Generative AI for Leaders
76%
seats filled