The Warmup And Cosine Schedule

Why schedule the rate

A fixed learning rate is rarely ideal across all of training. A schedule changes the rate over time to start safely, learn fast, then settle gently.

Warmup

Warmup starts the learning rate small and increases it over the first steps.

Early gradients can be noisy and erratic.
A small initial rate avoids destabilizing the freshly initialized model.
It is especially helpful for large batches and transformers.

Cosine decay

After warmup, a cosine schedule smoothly decreases the rate following a half cosine curve, easing toward a small final value.

It spends time at a high rate to make fast progress.
It then decays gradually so steps shrink near the minimum.

Together, warmup then cosine decay is a robust default for modern deep learning, often outperforming a constant rate.

Key idea

Warmup ramps the learning rate up to stabilize early training, then a cosine schedule glides it back down so the model makes fast progress before settling gently near a minimum.

The Warmup And Cosine Schedule

Why schedule the rate

Warmup

Cosine decay

Key idea

Check yourself