Why schedule the rate
A fixed learning rate is rarely ideal across all of training. A schedule changes the rate over time to start safely, learn fast, then settle gently.
Warmup
Warmup starts the learning rate small and increases it over the first steps.
- Early gradients can be noisy and erratic.
- A small initial rate avoids destabilizing the freshly initialized model.
- It is especially helpful for large batches and transformers.
Cosine decay
After warmup, a cosine schedule smoothly decreases the rate following a half cosine curve, easing toward a small final value.
- It spends time at a high rate to make fast progress.
- It then decays gradually so steps shrink near the minimum.
Together, warmup then cosine decay is a robust default for modern deep learning, often outperforming a constant rate.
Key idea
Warmup ramps the learning rate up to stabilize early training, then a cosine schedule glides it back down so the model makes fast progress before settling gently near a minimum.