← Lessons

quiz vs the machine

Silver1130

Machine Learning

The Normalization Layers Compared

Batch, layer, group, and instance normalization and when each one fits.

5 min read · intro · beat Silver to climb

Why normalize inside the network

As training proceeds, the distribution of each layer's inputs shifts, slowing learning. Normalization layers rescale activations to a stable mean and variance, letting you use higher learning rates and train deeper.

The four flavors

  • Batch norm normalizes each feature across the batch. Powerful for convolutional nets but unstable with tiny batches.
  • Layer norm normalizes across features within one sample, so it ignores batch size. It dominates transformers.
  • Group norm splits channels into groups and normalizes each, a good middle ground for small batches.
  • Instance norm normalizes each sample and channel separately, popular in style transfer.

Choosing by axis

Practical cautions

  • Batch norm behaves differently in training versus inference, using running statistics at test time. Forgetting to switch modes is a classic bug.
  • Layer norm needs no running stats, which is why it suits variable length sequences.
  • Normalization also has a mild regularizing effect from the noise in batch statistics.

Key idea

Batch norm normalizes across the batch and suits large batch conv nets; layer norm normalizes across features and suits transformers. Group norm is the safe choice when batches are small.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is layer norm preferred over batch norm in transformers?

2. What is a classic batch norm bug at inference time?