What is Batch Normalization?

By Rahul Singh

Updated on Jun 01, 2026 | 10 min read | 4.22K+ views

Share:

Batch normalization is a widely used technique in deep learning that helps neural networks train faster and more reliably. It works by normalizing the inputs passed between layers, ensuring that data maintains a stable distribution throughout the training process.

By keeping activations centered and scaled appropriately, batch normalization reduces issues such as vanishing and exploding gradients. This leads to smoother learning, faster convergence, and improved performance in deep neural networks.

In this blog, you'll learn what is batch normalization, why it matters, how it works mathematically, and where it is used.

Build practical AI and ML skills with upGrad’s Artificial Intelligence Courses. Learn machine learning, generative AI, and emerging technologies while working on real-world projects. 

What Is Batch Normalization?

Batch normalization is a technique that normalizes the output of each layer in a neural network before passing it to the next layer. It was introduced in a 2015 paper by Sergey Ioffe and Christian Szegedy at Google, and it quickly became one of the most widely used tools in deep learning.

The core idea is simple. During training, the distribution of inputs to each layer keeps shifting as the weights of the previous layers change. This is called internal covariate shift. Every time the weights update, the inputs seen by the next layer look slightly different. The layer has to keep adapting to this moving target, which slows down learning significantly.

Batch normalization solves this by rescaling the inputs to each layer so they consistently have a mean of zero and a standard deviation of one, using the statistics from the current mini-batch. Then it applies two learnable parameters to restore the model's ability to represent any distribution it needs.

Also Read: Deep Learning Architecture: CNN, RNN, and Transformers

The Problem It Was Designed to Solve

Before batch normalization in deep learning became standard, training deep networks required:

  • Very low learning rates to prevent divergence
  • Careful weight initialization to avoid vanishing or exploding gradients
  • Extensive regularization to prevent overfitting

With batch normalization, many of these concerns are reduced. Training becomes faster, more stable, and less sensitive to the initial setup.

Where It Sits in the Network

Batch normalization is applied after the linear transformation (the matrix multiplication) and before the activation function, although some implementations place it after the activation. The placement can vary by architecture, but the before-activation position is the one described in the original paper.

Layer Type Typical Position of Batch Normalization
Fully connected layer After linear transform, before activation
Convolutional layer After convolution, before activation
Recurrent layer Less common; layer normalization preferred

How Batch Normalization Works: The Math

Understanding what batch normalization does mathematically is essential. The process happens in four clear steps during training.

Step 1: Compute the Mini-Batch Mean

For a mini-batch of m samples, compute the mean of each feature:

μ_B = (1/m) * Σ x_i

This gives a single mean value per feature across all samples in the batch.

Step 2: Compute the Mini-Batch Variance

σ²_B = (1/m) * Σ (x_i - μ_B)²

This measures how spread out the values are for each feature in the batch.

Step 3: Normalize

x̂_i = (x_i - μ_B) / sqrt(σ²_B + ε)

The small constant ε (epsilon, usually 1e-5) is added to avoid division by zero when the variance is very small.

After this step, each feature in the batch has mean 0 and variance 1.

Step 4: Scale and Shift

y_i = γ * x̂_i + β

Here, γ (gamma) and β (beta) are learnable parameters. These are what give the network flexibility. Without them, batch normalization would always force every layer's output to have mean 0 and variance 1, which could actually hurt the model's ability to represent complex functions. With gamma and beta, the network can learn whatever distribution is optimal for each layer.

Also Read: What Are Activation Functions in Neural Networks? Functioning,Types & Real-world Examples

What Happens During Inference?

During inference, there is no mini-batch to compute statistics from. Instead, the algorithm uses running averages of the mean and variance collected during training. These are tracked using exponential moving averages and are frozen once training is complete.

Phase Mean and Variance Source
Training Computed from current mini-batch
Inference Stored running averages from training

Implementing Batch Normalization in Python

Seeing the code makes the concept much more concrete. Here is how batch normalization is used in practice, both from scratch and with frameworks.

From Scratch in NumPy

This is a simple manual implementation to build intuition in Python:

import numpy as np
def batch_norm_forward(X, gamma, beta, eps=1e-5):
    # Step 1: Compute batch mean
    mu = np.mean(X, axis=0)

    # Step 2: Compute batch variance
    var = np.var(X, axis=0)

    # Step 3: Normalize
    X_norm = (X - mu) / np.sqrt(var + eps)

    # Step 4: Scale and shift
    out = gamma * X_norm + beta

    return out, X_norm, mu, var

This captures the full forward pass. During backpropagation, gradients flow through gamma and beta to update them along with all other network weights.

Using PyTorch

In PyTorch, batch normalization is one line:

import torch
import torch.nn as nn

# For fully connected layers (1D input)
bn = nn.BatchNorm1d(num_features=128)

# For convolutional layers (2D feature maps)
bn_conv = nn.BatchNorm2d(num_features=64)

# Inside a model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)   # Normalize before activation
        x = self.relu(x)
        x = self.fc2(x)
        return x

Using TensorFlow/Keras

import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
    layers.Dense(256),
    layers.BatchNormalization(),   # Added after Dense, before activation
    layers.Activation('relu'),
    layers.Dense(10, activation='softmax')
])

Key Parameters to Know

Parameter What It Controls
momentum Controls how fast running statistics update (default ~0.1)
eps Small constant for numerical stability (default 1e-5)
affine / scale Whether gamma and beta are learnable (default: True)
training Whether to use batch stats or running averages

Also Read: TensorFlow Tutorial

Benefits of Batch Normalization in Deep Learning

Batch normalization in deep learning offers several concrete advantages that explain why it became a standard component in nearly every modern architecture.

1. Faster Training

Because inputs to each layer are normalized, the optimizer does not have to compensate for varying input scales. This allows the use of higher learning rates without divergence. In practice, training often converges in significantly fewer epochs.

2. Reduces Dependence on Weight Initialization

Before batch normalization, poor weight initialization could lead to vanishing or exploding gradients, especially in very deep networks. Batch normalization reduces this sensitivity because it standardizes inputs at each layer, making the gradient flow more predictable regardless of how the weights started.

3. Acts as a Regularizer

Batch normalization introduces noise into the training process because the mean and variance are computed from a mini-batch, not the full dataset. This noise acts as a mild regularizer, similar to dropout. In many networks, adding batch normalization allows you to reduce dropout or remove it entirely.

4. Enables Deeper Networks

Without normalization, training networks deeper than 20 or 30 layers was very difficult. With batch normalization, architectures like ResNet (which has 50 to 152 layers) became trainable. It is a key reason why the deep learning field was able to scale to the architectures we have today.

Summary of Benefits

  • Higher learning rates become safe to use
  • Less need for meticulous weight initialization
  • Mild regularization effect reduces overfitting
  • Enables stable training in very deep architectures
  • Faster convergence in most cases

Also Read: Neural Network Model: Brief Introduction, Glossary & Backpropagation

Batch Normalization vs. Other Normalization Techniques

Batch normalization is not the only normalization method. Several alternatives exist, each designed to handle cases where batch normalization falls short.

Why Alternatives Were Needed

Batch normalization has a known limitation: it struggles when the batch size is very small. With a batch size of 1 or 2, the mean and variance estimates are too noisy to be useful. It also performs poorly in recurrent neural networks, where the sequence length varies and the batch statistics are hard to track.

Comparison Table

Technique Normalizes Over Best Use Case
Batch Normalization Batch dimension CNNs, large-batch training
Layer Normalization Feature dimension

Transformers, RNNs, NLP

Instance Normalization Single sample, spatial dims Style transfer, image generation
Group Normalization Groups of channels Small batch sizes, detection

Layer Normalization

Layer normalization computes mean and variance over all features for a single sample, not across the batch. This makes it independent of batch size, which is why it became the standard in Transformer models and large language models. If you have worked with BERT or GPT, every layer uses layer normalization, not batch normalization.

Group Normalization

Group normalization divides channels into groups and normalizes within each group. It was developed specifically for object detection tasks where batch sizes are small due to memory constraints from high-resolution images. It consistently outperforms batch normalization when batch sizes drop below 8.

When to Use Batch Normalization

  • Use it for CNNs trained on image classification with standard batch sizes
  • Avoid it in recurrent networks; use layer normalization instead
  • Avoid it when batch size is consistently below 8; use group normalization
  • It is generally the default choice for feedforward and convolutional architectures

Also Read: Guide to CNN Deep Learning

Conclusion

Batch normalization was one of the most important contributions to deep learning in the last decade. It addressed a real and fundamental problem: the instability of training deep networks when the distribution of inputs to each layer keeps changing. By normalizing activations at each layer using mini-batch statistics and introducing two learnable parameters to restore representational power, it made deep network training faster, more stable, and far more accessible.

If you are building or fine-tuning neural networks, batch normalization belongs in your toolkit. Know when to use it, know when to replace it, and know what it is actually doing to your data at each step.

Want personalized guidance on AI and upskilling? Speak with an expert for a free 1:1 counselling session today.     

Frequently Asked Question (FAQs)

1. Why is batch normalization applied before the activation function?

Applying batch normalization before activation keeps inputs within a stable range. If applied after ReLU, negative values are already removed. Normalizing first improves gradient flow and usually helps the network learn faster and more effectively.

2. What happens if the batch size is 1 when using batch normalization?

With a batch size of 1, variance becomes zero and batch statistics lose meaning. Although epsilon provides numerical stability, normalization becomes ineffective. In such cases, layer normalization or instance normalization is generally preferred.

3. Do gamma and beta get updated during backpropagation?

Yes. Gamma and beta are trainable parameters initialized to 1 and 0. During training, gradients flow through them, and the optimizer updates them alongside the network's weights and biases.

4. How does batch normalization affect training speed?

Batch normalization often speeds up training by maintaining stable input distributions across layers. This allows larger learning rates and reduces optimization difficulties. Many models converge in fewer epochs compared to those without batch normalization.

5. Is batch normalization the same as data normalization or feature scaling?

No. Data normalization is applied to input features before training. Batch normalization occurs inside the network during training and normalizes intermediate activations using mini-batch statistics. Both techniques serve different purposes and are often used together.

6. Can batch normalization be used in convolutional neural networks?

Yes. Batch normalization is widely used in CNNs. It normalizes activations across batches and spatial dimensions for each channel. Most modern image recognition architectures include batch normalization after convolutional layers.

7. Does batch normalization eliminate the need for dropout?

Not entirely. Batch normalization introduces a mild regularization effect through mini-batch statistics. Many models can reduce dropout usage, but when overfitting is significant, both batch normalization and dropout are often used together.

8. Why does batch normalization use running averages during inference?

During inference, predictions are often made one sample at a time. Since batch statistics are unavailable, the model uses running averages of the mean and variance collected during training for stable and consistent predictions.

9. What is the difference between batch normalization and weight normalization?

Batch normalization normalizes layer activations using batch statistics. Weight normalization modifies the weight vectors by separating magnitude and direction. It does not depend on batch size and can be useful in specific architectures.

10. How does batch normalization help with vanishing gradients?

Batch normalization keeps activations within a reasonable range across layers. This helps maintain stronger gradient signals during backpropagation, reducing the risk of vanishing gradients and making very deep networks easier to train.

11. Is batch normalization used in transformer models like GPT or BERT?

No. Transformer models use layer normalization instead of batch normalization. Layer normalization operates across feature dimensions for individual tokens, making it better suited for variable-length sequences used in NLP models.

Rahul Singh

49 articles published

Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program