Residual Networks
As networks grew deeper, training got harder, not easier. Residual networks, or ResNets, solved this with skip connections that let signals bypass layers.
The degradation problem
Very deep plain networks sometimes did worse than shallower ones, even on training data. The issue was optimization: gradients struggled to flow through dozens of stacked layers.
The residual block
A residual block adds the block's input directly to its output. Instead of learning a full transformation, the layers learn a residual, the change to apply on top of the input.
- The skip path passes the input through unchanged.
- The layers learn only the difference from the input.
- If no change helps, the block can easily learn to do nothing.
Why it works
The skip connection gives gradients a short path back through the network, so even hundred layer models train. Learning a residual near zero is easy, so adding layers rarely hurts. This let researchers train networks far deeper than before with steady gains in accuracy.
The pattern is now standard and appears in nearly every modern vision and language architecture.
Key idea
Residual networks add the input to the block output so layers learn a residual, which eases gradient flow and lets networks be very deep.