The Normalization Layers Compared

Why normalize inside the network

As training proceeds, the distribution of each layer's inputs shifts, slowing learning. Normalization layers rescale activations to a stable mean and variance, letting you use higher learning rates and train deeper.

The four flavors

Batch norm normalizes each feature across the batch. Powerful for convolutional nets but unstable with tiny batches.
Layer norm normalizes across features within one sample, so it ignores batch size. It dominates transformers.
Group norm splits channels into groups and normalizes each, a good middle ground for small batches.
Instance norm normalizes each sample and channel separately, popular in style transfer.

Choosing by axis

Practical cautions

Batch norm behaves differently in training versus inference, using running statistics at test time. Forgetting to switch modes is a classic bug.
Layer norm needs no running stats, which is why it suits variable length sequences.
Normalization also has a mild regularizing effect from the noise in batch statistics.

Key idea

Batch norm normalizes across the batch and suits large batch conv nets; layer norm normalizes across features and suits transformers. Group norm is the safe choice when batches are small.

The Normalization Layers Compared

Why normalize inside the network

The four flavors

Choosing by axis

Practical cautions

Key idea

Check yourself