Why normalize
Normalization layers rescale activations to keep their distribution stable across training, which speeds convergence and steadies gradients.
Batch norm
Batch normalization standardizes each feature using the mean and variance over the current mini batch, then applies a learnable scale and shift.
- It works well in convolutional vision models.
- It behaves differently at training versus inference, where running statistics are used.
- It is sensitive to small batch sizes.
Layer norm
Layer normalization standardizes across the features of a single example instead of across the batch.
- It does not depend on batch size or batch statistics.
- This makes it the default in transformers and recurrent models.
Key idea
Batch norm normalizes over the batch per feature while layer norm normalizes over the features per example, both adding a learnable scale and shift to stabilize training.