← Lessons

quiz vs the machine

Gold1410

Machine Learning

Residual Connections

Skip paths that let very deep networks learn by adding to the input.

4 min read · core · beat Gold to climb

The problem they solve

As networks get deeper, plain stacking can make them harder to train, not easier. Early deep models actually got worse with more layers because gradients struggled to flow and each layer had to relearn the identity.

What a residual connection does

A residual connection adds the input of a block directly to its output. The block only has to learn the residual, meaning the change to apply rather than the full transformation.

  • The output is the input plus the block result
  • If the best thing to do is nothing, the block can learn to output zero
  • Gradients flow straight through the addition during backpropagation

This skip path is why architectures with hundreds of layers became trainable.

Why it matters for transformers

Every attention and feed forward sub layer in a transformer is wrapped in a residual connection, usually paired with normalization. The skip path keeps the signal and the gradient strong from the first layer to the last, which is essential for the depth that modern models reach.

A simple intuition

Think of each layer as proposing a small edit to a running representation. The residual stream carries the representation forward, and layers write refinements into it.

Key idea

Residual connections add a block's input to its output so deep networks learn refinements and gradients flow freely.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a block with a residual connection have to learn?

2. How do residual connections help gradient flow?