The skip path
A residual connection adds a layer input directly to its output, so the layer only has to learn a residual correction rather than the full mapping.
- The output is the input plus the transformed input.
- This creates a direct path for the signal to skip the layer.
Why it works
- Gradients flow through the addition almost unchanged, easing vanishing gradients.
- A layer can default to the identity simply by outputting near zero, so extra depth never hurts.
- This made networks with hundreds of layers trainable.
Residuals are core to deep convolutional nets and to every transformer block, where they wrap both attention and feedforward sublayers.
Key idea
Residual connections add the input back to the output, giving gradients a clean path and letting layers learn small corrections so very deep networks stay trainable.