← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The Layer and Batch Norm

Normalization stabilizes activations to speed and steady training.

5 min read · advanced · beat Platinum to climb

Why normalize

Normalization layers rescale activations to keep their distribution stable across training, which speeds convergence and steadies gradients.

Batch norm

Batch normalization standardizes each feature using the mean and variance over the current mini batch, then applies a learnable scale and shift.

  • It works well in convolutional vision models.
  • It behaves differently at training versus inference, where running statistics are used.
  • It is sensitive to small batch sizes.

Layer norm

Layer normalization standardizes across the features of a single example instead of across the batch.

  • It does not depend on batch size or batch statistics.
  • This makes it the default in transformers and recurrent models.

Key idea

Batch norm normalizes over the batch per feature while layer norm normalizes over the features per example, both adding a learnable scale and shift to stabilize training.

Check yourself

Answer to earn rating on the learn ladder.

1. Over what does batch normalization compute statistics?

2. Why is layer norm preferred in transformers?

3. What learnable terms do both norms add?