Warmup and Cosine Decay
Modern training of large models often shapes the learning rate as a warmup followed by a cosine decay, a combination that stabilizes the start and polishes the finish.
The warmup phase
- Begin with a near zero rate and ramp it up over the first steps.
- This avoids huge early updates while moments and statistics are still unreliable.
- Adaptive optimizers especially benefit because their estimates start noisy.
The cosine phase
After warmup the rate follows the smooth shape of a half cosine, declining gently from its peak toward near zero by the end of training. The decline is slow at first, faster in the middle, then slow again, which lets the model settle delicately into a minimum without abrupt jumps.
Why the pair works
Warmup protects the fragile early phase when a full rate could diverge, and cosine decay gives the smooth annealing that helps generalization. Together they have become a near default schedule for training transformers, where they reliably outperform a fixed rate with little extra tuning.
Key idea
Warmup ramps the rate up to protect the noisy start, and cosine decay glides it down for a smooth, generalizing finish.