Batch Normalization Revisited
Batch normalization standardizes the activations entering a layer using statistics computed across the current mini batch, which smooths and speeds up training.
The mechanism
- Compute the mean and variance of each feature across the batch.
- Subtract the mean and divide by the standard deviation to normalize.
- Apply a learned scale and shift so the layer can recover any needed range.
Why it helps
Normalizing keeps activations in a stable range as weights change, which reduces the sensitivity to initialization and lets you use larger learning rates. It also adds a small amount of noise from batch statistics, giving a mild regularizing effect. Training simply converges faster and more reliably.
The batch dependence catch
Because statistics come from the batch, behavior changes with batch size and differs between training and inference. At inference the layer uses running averages collected during training. Very small batches make the estimates noisy, which is one reason layer normalization is preferred in some settings. Despite these quirks, batch norm remains a workhorse for convolutional vision models.
Key idea
Batch norm standardizes activations using batch statistics, speeding convergence but tying behavior to batch size.