← Lessons

quiz vs the machine

Gold1440

Machine Learning

Batch Normalization Revisited

Normalizing activations across the batch to stabilize training.

5 min read · core · beat Gold to climb

Batch Normalization Revisited

Batch normalization standardizes the activations entering a layer using statistics computed across the current mini batch, which smooths and speeds up training.

The mechanism

  • Compute the mean and variance of each feature across the batch.
  • Subtract the mean and divide by the standard deviation to normalize.
  • Apply a learned scale and shift so the layer can recover any needed range.

Why it helps

Normalizing keeps activations in a stable range as weights change, which reduces the sensitivity to initialization and lets you use larger learning rates. It also adds a small amount of noise from batch statistics, giving a mild regularizing effect. Training simply converges faster and more reliably.

The batch dependence catch

Because statistics come from the batch, behavior changes with batch size and differs between training and inference. At inference the layer uses running averages collected during training. Very small batches make the estimates noisy, which is one reason layer normalization is preferred in some settings. Despite these quirks, batch norm remains a workhorse for convolutional vision models.

Key idea

Batch norm standardizes activations using batch statistics, speeding convergence but tying behavior to batch size.

Check yourself

Answer to earn rating on the learn ladder.

1. What statistics does batch norm use to normalize?

2. What does batch norm use at inference time?

3. Why can very small batches hurt batch norm?