When gradients blow up
The opposite of vanishing is the exploding gradient. When the multiplied derivatives during backprop are larger than one, their product grows huge as it travels back through many layers. The weights then receive enormous updates that destabilize training.
- Signs include loss jumping to very large values or to not a number.
- Causes include deep or recurrent networks and large weights.
- Effect is a single step that wrecks the model.
Gradient clipping
The standard fix is gradient clipping. Before applying an update, you check the size of the gradient and shrink it if it exceeds a limit.
- Clip by norm rescales the whole gradient so its length stays under a threshold.
- Clip by value caps each component to a fixed range.
Clipping by norm is usually preferred because it preserves the gradient's direction while only limiting its magnitude.
Where it matters most
Recurrent networks unrolled over long sequences are especially prone to exploding gradients, so clipping is a routine part of training them.
Key idea
Exploding gradients come from derivatives larger than one compounding through deep networks, and clipping by norm caps the gradient magnitude while keeping its direction so updates stay stable.