Residual And Layer Norm Placement

Two stabilizers

Every transformer sublayer is wrapped with a residual connection that adds the input back to the output, and a layer normalization that rescales activations. Where exactly the norm sits has a big effect on training.

Post norm versus pre norm

Post norm applies normalization after adding the residual, the original design.
Pre norm applies normalization to the input before the sublayer, then adds the clean residual.

Why pre norm wins for depth

In pre norm the residual path stays an unnormalized highway from input to output, so gradients flow straight back through many layers. This keeps deep stacks stable and often removes the need for careful learning rate warmup. Post norm can suffer vanishing or exploding signals as depth grows.

The residual highway

The residual connection lets each sublayer learn a small correction rather than a full transformation. Combined with pre norm, this is what makes models with dozens or hundreds of layers trainable at all.

Key idea

Residual connections give each sublayer a clean highway to add a correction, and placing layer norm before the sublayer keeps gradients flowing so very deep transformers stay stable.

Residual And Layer Norm Placement

Two stabilizers

Post norm versus pre norm

Why pre norm wins for depth

The residual highway

Key idea

Check yourself