The Learning Rate Schedule

The learning rate sets how big each update step is. A schedule changes that step size over training instead of holding it fixed, which usually trains faster and lands lower.

The tradeoff

A large rate moves quickly but can overshoot and bounce.
A small rate is stable but painfully slow and may stall.
No single value is ideal for the whole run.

Common schedules

A typical plan starts with a moderate rate to make fast early progress, then decays it so the model can settle gently into a minimum. Step decay drops the rate at fixed milestones. Exponential decay shrinks it smoothly. Many modern runs combine a short warmup with a slow cosine decline.

Why decay helps

Early in training the parameters are far from any good region, so big steps pay off. Later the model is near a minimum, where big steps would just rattle around it. Shrinking the rate lets early speed and late precision both happen in one run.

Key idea

A learning rate schedule starts large for speed and decays for precision, combining fast progress with a gentle landing.

The Learning Rate Schedule