The Learning Rate Schedule
The learning rate sets how big each update step is. A schedule changes that step size over training instead of holding it fixed, which usually trains faster and lands lower.
The tradeoff
- A large rate moves quickly but can overshoot and bounce.
- A small rate is stable but painfully slow and may stall.
- No single value is ideal for the whole run.
Common schedules
A typical plan starts with a moderate rate to make fast early progress, then decays it so the model can settle gently into a minimum. Step decay drops the rate at fixed milestones. Exponential decay shrinks it smoothly. Many modern runs combine a short warmup with a slow cosine decline.
Why decay helps
Early in training the parameters are far from any good region, so big steps pay off. Later the model is near a minimum, where big steps would just rattle around it. Shrinking the rate lets early speed and late precision both happen in one run.
Key idea
A learning rate schedule starts large for speed and decays for precision, combining fast progress with a gentle landing.