Two stabilizers
Every transformer sublayer is wrapped with a residual connection that adds the input back to the output, and a layer normalization that rescales activations. Where exactly the norm sits has a big effect on training.
Post norm versus pre norm
- Post norm applies normalization after adding the residual, the original design.
- Pre norm applies normalization to the input before the sublayer, then adds the clean residual.
Why pre norm wins for depth
In pre norm the residual path stays an unnormalized highway from input to output, so gradients flow straight back through many layers. This keeps deep stacks stable and often removes the need for careful learning rate warmup. Post norm can suffer vanishing or exploding signals as depth grows.
The residual highway
The residual connection lets each sublayer learn a small correction rather than a full transformation. Combined with pre norm, this is what makes models with dozens or hundreds of layers trainable at all.
Key idea
Residual connections give each sublayer a clean highway to add a correction, and placing layer norm before the sublayer keeps gradients flowing so very deep transformers stay stable.