The problem
Sometimes gradients become enormous, an exploding gradient. A single oversized step can throw the parameters far off and produce NaNs, especially in recurrent networks and deep stacks.
What clipping does
Gradient clipping limits the gradient before the update.
- Norm clipping: if the gradient's overall norm exceeds a threshold, rescale it down to that norm while keeping its direction.
- Value clipping: clamp each component to a range.
Norm clipping is usually preferred because it preserves the update direction.
Why it helps
Clipping bounds the worst case step size, smoothing training without changing typical small gradients. It is a cheap safety net rather than a learning rate replacement.
It is especially common when training recurrent networks and transformers on long sequences.
Key idea
Gradient clipping caps the gradient norm before an update so exploding gradients cannot cause a single ruinous step, acting as a cheap safety net that preserves the update direction.