The Layer and Batch Norm

Why normalize

Normalization layers rescale activations to keep their distribution stable across training, which speeds convergence and steadies gradients.

Batch norm

Batch normalization standardizes each feature using the mean and variance over the current mini batch, then applies a learnable scale and shift.

It works well in convolutional vision models.
It behaves differently at training versus inference, where running statistics are used.
It is sensitive to small batch sizes.

Layer norm

Layer normalization standardizes across the features of a single example instead of across the batch.

It does not depend on batch size or batch statistics.
This makes it the default in transformers and recurrent models.

Key idea

Batch norm normalizes over the batch per feature while layer norm normalizes over the features per example, both adding a learnable scale and shift to stabilize training.

The Layer and Batch Norm

Why normalize

Batch norm

Layer norm

Key idea

Check yourself