The danger of huge gradients
Sometimes a batch produces an enormous gradient. This can happen with rare tokens, deep recurrent networks, or unlucky data. A single giant update can throw the weights far from any good region and cause the loss to spike or become not a number.
What clipping does
Gradient clipping limits how large the gradient can be before it is used to update the weights. The most common form is clipping by norm.
- Compute the total norm of all gradients
- If the norm exceeds a chosen threshold, scale the whole gradient down so its norm equals the threshold
- Otherwise leave it unchanged
This preserves the direction of the update while capping its magnitude.
Clip by value
A simpler variant clips each individual gradient element to lie within a fixed range. This is easy but changes the update direction, so clipping by norm is usually preferred.
Where it helps most
- Training recurrent networks and transformers
- Runs that show occasional loss spikes
- Large batch or high learning rate settings
A typical threshold is around one. Clipping is a safety net that keeps a rare bad batch from undoing thousands of good steps.
Key idea
Gradient clipping caps the update magnitude so a single huge gradient cannot destabilize the entire run.