← Lessons

quiz vs the machine

Gold1380

Machine Learning

Gradient Clipping

Capping the size of gradients so a single bad batch cannot blow up training.

4 min read · core · beat Gold to climb

The danger of huge gradients

Sometimes a batch produces an enormous gradient. This can happen with rare tokens, deep recurrent networks, or unlucky data. A single giant update can throw the weights far from any good region and cause the loss to spike or become not a number.

What clipping does

Gradient clipping limits how large the gradient can be before it is used to update the weights. The most common form is clipping by norm.

  • Compute the total norm of all gradients
  • If the norm exceeds a chosen threshold, scale the whole gradient down so its norm equals the threshold
  • Otherwise leave it unchanged

This preserves the direction of the update while capping its magnitude.

Clip by value

A simpler variant clips each individual gradient element to lie within a fixed range. This is easy but changes the update direction, so clipping by norm is usually preferred.

Where it helps most

  • Training recurrent networks and transformers
  • Runs that show occasional loss spikes
  • Large batch or high learning rate settings

A typical threshold is around one. Clipping is a safety net that keeps a rare bad batch from undoing thousands of good steps.

Key idea

Gradient clipping caps the update magnitude so a single huge gradient cannot destabilize the entire run.

Check yourself

Answer to earn rating on the learn ladder.

1. What does clipping by norm preserve while limiting the gradient?

2. Why is clipping by norm usually preferred over clipping by value?