The idea
Layer normalization standardizes the activations of a single example across its feature dimension. For each token vector it subtracts the mean and divides by the standard deviation, then applies a learned scale and shift.
How it differs from batch norm
Batch normalization computes statistics across the batch dimension, so each feature is normalized using other examples. That couples examples together and behaves differently at training and inference time.
- Layer norm uses statistics from one example only
- It does not depend on batch size
- It behaves the same during training and inference
These properties make it a natural fit for sequence models, where batch composition and sequence length vary.
Why transformers use it
Transformers stack many attention and feed forward blocks. Layer norm keeps the scale of activations steady as they flow through depth, which stabilizes optimization and works well with residual connections.
Variants
- Pre norm places the normalization before each sub layer and trains more stably for deep models
- RMSNorm drops the mean subtraction and only rescales, which is cheaper and now common in large language models
Key idea
Layer normalization standardizes each example across its features, giving batch independent stability that suits transformers.