← Lessons

quiz vs the machine

Gold1420

Machine Learning

Layer Normalization

Normalizing each example across its features to stabilize transformer training.

4 min read · core · beat Gold to climb

The idea

Layer normalization standardizes the activations of a single example across its feature dimension. For each token vector it subtracts the mean and divides by the standard deviation, then applies a learned scale and shift.

How it differs from batch norm

Batch normalization computes statistics across the batch dimension, so each feature is normalized using other examples. That couples examples together and behaves differently at training and inference time.

  • Layer norm uses statistics from one example only
  • It does not depend on batch size
  • It behaves the same during training and inference

These properties make it a natural fit for sequence models, where batch composition and sequence length vary.

Why transformers use it

Transformers stack many attention and feed forward blocks. Layer norm keeps the scale of activations steady as they flow through depth, which stabilizes optimization and works well with residual connections.

Variants

  • Pre norm places the normalization before each sub layer and trains more stably for deep models
  • RMSNorm drops the mean subtraction and only rescales, which is cheaper and now common in large language models

Key idea

Layer normalization standardizes each example across its features, giving batch independent stability that suits transformers.

Check yourself

Answer to earn rating on the learn ladder.

1. Across what does layer normalization compute its statistics?

2. Why is layer norm well suited to transformers compared to batch norm?

3. What does RMSNorm change relative to standard layer norm?