← Lessons

quiz vs the machine

Gold1450

Machine Learning

The Warmup And Cosine Schedule

Ramp the learning rate up, then glide it down along a cosine curve.

5 min read · core · beat Gold to climb

Why schedule the rate

A fixed learning rate is rarely ideal across all of training. A schedule changes the rate over time to start safely, learn fast, then settle gently.

Warmup

Warmup starts the learning rate small and increases it over the first steps.

  • Early gradients can be noisy and erratic.
  • A small initial rate avoids destabilizing the freshly initialized model.
  • It is especially helpful for large batches and transformers.

Cosine decay

After warmup, a cosine schedule smoothly decreases the rate following a half cosine curve, easing toward a small final value.

  • It spends time at a high rate to make fast progress.
  • It then decays gradually so steps shrink near the minimum.

Together, warmup then cosine decay is a robust default for modern deep learning, often outperforming a constant rate.

Key idea

Warmup ramps the learning rate up to stabilize early training, then a cosine schedule glides it back down so the model makes fast progress before settling gently near a minimum.

Check yourself

Answer to earn rating on the learn ladder.

1. What does learning rate warmup do?

2. What shape does a cosine schedule follow after warmup?

3. Why is warmup especially helpful early in training?