The problem at the start
At the very beginning of training, the weights are random and the gradient estimates are unreliable. Adaptive optimizers like Adam also have noisy moment estimates in the first steps. A large learning rate here can push the weights into a bad region or cause the loss to diverge.
What warmup does
Warmup starts the learning rate at a small value and increases it gradually over the first several hundred or thousand steps until it reaches the target rate.
- A linear warmup raises the rate in equal increments
- After warmup the schedule usually decays, often with a cosine curve
- The warmup length is a tunable number of steps
Why it matters for big models
Large transformers and large batch sizes are especially sensitive. Without warmup, the early updates can destabilize layer normalization statistics and the attention weights, leading to loss spikes that never recover.
A typical recipe
- Warm up linearly for a few thousand steps
- Hold or peak at the target rate
- Decay slowly toward zero for the rest of training
Warmup is cheap insurance. It costs a little time early on and greatly improves the odds of a stable run.
Key idea
Warmup ramps the learning rate up from a small value so unstable early gradients do not derail training.