Layer Normalization

The idea

Layer normalization standardizes the activations of a single example across its feature dimension. For each token vector it subtracts the mean and divides by the standard deviation, then applies a learned scale and shift.

How it differs from batch norm

Batch normalization computes statistics across the batch dimension, so each feature is normalized using other examples. That couples examples together and behaves differently at training and inference time.

Layer norm uses statistics from one example only
It does not depend on batch size
It behaves the same during training and inference

These properties make it a natural fit for sequence models, where batch composition and sequence length vary.

Why transformers use it

Transformers stack many attention and feed forward blocks. Layer norm keeps the scale of activations steady as they flow through depth, which stabilizes optimization and works well with residual connections.

Variants

Pre norm places the normalization before each sub layer and trains more stably for deep models
RMSNorm drops the mean subtraction and only rescales, which is cheaper and now common in large language models

Key idea