The problem they solve
As networks get deeper, plain stacking can make them harder to train, not easier. Early deep models actually got worse with more layers because gradients struggled to flow and each layer had to relearn the identity.
What a residual connection does
A residual connection adds the input of a block directly to its output. The block only has to learn the residual, meaning the change to apply rather than the full transformation.
- The output is the input plus the block result
- If the best thing to do is nothing, the block can learn to output zero
- Gradients flow straight through the addition during backpropagation
This skip path is why architectures with hundreds of layers became trainable.
Why it matters for transformers
Every attention and feed forward sub layer in a transformer is wrapped in a residual connection, usually paired with normalization. The skip path keeps the signal and the gradient strong from the first layer to the last, which is essential for the depth that modern models reach.
A simple intuition
Think of each layer as proposing a small edit to a running representation. The residual stream carries the representation forward, and layers write refinements into it.
Key idea
Residual connections add a block's input to its output so deep networks learn refinements and gradients flow freely.