← Lessons

quiz vs the machine

Gold1420

Machine Learning

Residual And Layer Norm Placement

Why pre norm transformers train more easily than post norm ones.

5 min read · core · beat Gold to climb

Two stabilizers

Every transformer sublayer is wrapped with a residual connection that adds the input back to the output, and a layer normalization that rescales activations. Where exactly the norm sits has a big effect on training.

Post norm versus pre norm

  • Post norm applies normalization after adding the residual, the original design.
  • Pre norm applies normalization to the input before the sublayer, then adds the clean residual.

Why pre norm wins for depth

In pre norm the residual path stays an unnormalized highway from input to output, so gradients flow straight back through many layers. This keeps deep stacks stable and often removes the need for careful learning rate warmup. Post norm can suffer vanishing or exploding signals as depth grows.

The residual highway

The residual connection lets each sublayer learn a small correction rather than a full transformation. Combined with pre norm, this is what makes models with dozens or hundreds of layers trainable at all.

Key idea

Residual connections give each sublayer a clean highway to add a correction, and placing layer norm before the sublayer keeps gradients flowing so very deep transformers stay stable.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is pre norm placement preferred for deep transformers?

2. What does the residual connection let each sublayer learn?