← Lessons

quiz vs the machine

Gold1430

Machine Learning

Warmup and Cosine Decay

Ramping the learning rate up then gliding it smoothly down.

4 min read · core · beat Gold to climb

Warmup and Cosine Decay

Modern training of large models often shapes the learning rate as a warmup followed by a cosine decay, a combination that stabilizes the start and polishes the finish.

The warmup phase

  • Begin with a near zero rate and ramp it up over the first steps.
  • This avoids huge early updates while moments and statistics are still unreliable.
  • Adaptive optimizers especially benefit because their estimates start noisy.

The cosine phase

After warmup the rate follows the smooth shape of a half cosine, declining gently from its peak toward near zero by the end of training. The decline is slow at first, faster in the middle, then slow again, which lets the model settle delicately into a minimum without abrupt jumps.

Why the pair works

Warmup protects the fragile early phase when a full rate could diverge, and cosine decay gives the smooth annealing that helps generalization. Together they have become a near default schedule for training transformers, where they reliably outperform a fixed rate with little extra tuning.

Key idea

Warmup ramps the rate up to protect the noisy start, and cosine decay glides it down for a smooth, generalizing finish.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the warmup phase protect against?

2. What shape does the rate follow after warmup in cosine decay?