Residual Connections

The problem they solve

As networks get deeper, plain stacking can make them harder to train, not easier. Early deep models actually got worse with more layers because gradients struggled to flow and each layer had to relearn the identity.

What a residual connection does

A residual connection adds the input of a block directly to its output. The block only has to learn the residual, meaning the change to apply rather than the full transformation.

The output is the input plus the block result
If the best thing to do is nothing, the block can learn to output zero
Gradients flow straight through the addition during backpropagation

This skip path is why architectures with hundreds of layers became trainable.

Why it matters for transformers

Every attention and feed forward sub layer in a transformer is wrapped in a residual connection, usually paired with normalization. The skip path keeps the signal and the gradient strong from the first layer to the last, which is essential for the depth that modern models reach.

A simple intuition

Think of each layer as proposing a small edit to a running representation. The residual stream carries the representation forward, and layers write refinements into it.

Key idea