The single most important knob
The learning rate sets how far each gradient step moves. It is often the hyperparameter that most affects whether training succeeds.
- Too high: steps overshoot the valley, loss oscillates or diverges.
- Too low: progress is painfully slow and may stall in a plateau.
- Just right: steady, fast descent toward a minimum.
What goes wrong
With a rate that is too large, each step can land on the far wall of the loss valley, bouncing higher each time. The loss curve climbs or swings wildly. With a rate too small, the loss curve flattens early and barely moves.
Tuning strategies
- Try a range on a log scale, such as factors of ten.
- Use a warmup to start small then grow.
- Decay the rate over time so steps shrink near a minimum.
Adaptive optimizers like Adam adjust an effective rate per parameter, but a sensible base rate still matters.
Key idea
The learning rate controls step size: too large diverges, too small crawls, so tuning and scheduling it is central to making gradient descent converge well.