Vanishing and Exploding Gradients Revisited
Because backprop multiplies many local gradients together, the signal can shrink toward zero or blow up toward infinity as it travels through many layers.
The multiplication problem
- If each layer scales the gradient by less than one, the product vanishes after many layers.
- If each layer scales it by more than one, the product explodes.
- Either way, early layers receive a corrupted update.
Symptoms
Vanishing gradients show up as early layers that barely change, so the network learns slowly or not at all in its depths. Exploding gradients show up as loss spikes, NaN values, and unstable training that diverges suddenly.
Modern remedies
The field developed several fixes that work together. Careful weight initialization keeps the scale near one at the start. Nonlinearities like ReLU avoid the saturating flats that crushed older sigmoid networks. Residual connections give gradients a shortcut path. Normalization layers keep activations in a healthy range, and gradient clipping caps explosions. Together these turned very deep networks from impossible to routine.
Key idea
Repeated multiplication makes gradients vanish or explode in deep nets, so we use initialization, residuals, normalization, and clipping to keep the signal stable.