Deeper stopped helping
Before residual networks, stacking more layers eventually made training error rise, not fall. The problem was optimization, not capacity. Gradients struggled to reach early layers.
The residual block
A skip connection adds the block input directly to its output, so the layers only learn a residual, the difference from identity. If the best mapping is close to identity, the block can simply push its weights toward zero.
- The output is the function of the input plus the input itself.
- The shortcut carries gradient straight back, easing the vanishing problem.
Why it trains
The added path gives gradients a clear route to early layers. Even very deep stacks now optimize because each block starts near a safe identity and learns a small correction.
Dimension matching
When channel count or spatial size changes, the shortcut uses a one by one convolution to match shapes before the addition. This keeps the sum well defined.
The lasting effect
Residual design unlocked networks with hundreds of layers and became a default building block far beyond vision.
Key idea
Skip connections add the input to a block output so layers learn a residual, giving gradients a direct path and letting networks of hundreds of layers train stably.