Why normalize inside the network
As training proceeds, the distribution of each layer's inputs shifts, slowing learning. Normalization layers rescale activations to a stable mean and variance, letting you use higher learning rates and train deeper.
The four flavors
- Batch norm normalizes each feature across the batch. Powerful for convolutional nets but unstable with tiny batches.
- Layer norm normalizes across features within one sample, so it ignores batch size. It dominates transformers.
- Group norm splits channels into groups and normalizes each, a good middle ground for small batches.
- Instance norm normalizes each sample and channel separately, popular in style transfer.
Choosing by axis
Practical cautions
- Batch norm behaves differently in training versus inference, using running statistics at test time. Forgetting to switch modes is a classic bug.
- Layer norm needs no running stats, which is why it suits variable length sequences.
- Normalization also has a mild regularizing effect from the noise in batch statistics.
Key idea
Batch norm normalizes across the batch and suits large batch conv nets; layer norm normalizes across features and suits transformers. Group norm is the safe choice when batches are small.