← Lessons

quiz vs the machine

Gold1360

Machine Learning

Learning Rate Warmup

Starting small and ramping up so early training does not explode.

4 min read · core · beat Gold to climb

The problem at the start

At the very beginning of training, the weights are random and the gradient estimates are unreliable. Adaptive optimizers like Adam also have noisy moment estimates in the first steps. A large learning rate here can push the weights into a bad region or cause the loss to diverge.

What warmup does

Warmup starts the learning rate at a small value and increases it gradually over the first several hundred or thousand steps until it reaches the target rate.

  • A linear warmup raises the rate in equal increments
  • After warmup the schedule usually decays, often with a cosine curve
  • The warmup length is a tunable number of steps

Why it matters for big models

Large transformers and large batch sizes are especially sensitive. Without warmup, the early updates can destabilize layer normalization statistics and the attention weights, leading to loss spikes that never recover.

A typical recipe

  • Warm up linearly for a few thousand steps
  • Hold or peak at the target rate
  • Decay slowly toward zero for the rest of training

Warmup is cheap insurance. It costs a little time early on and greatly improves the odds of a stable run.

Key idea

Warmup ramps the learning rate up from a small value so unstable early gradients do not derail training.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is the learning rate kept small at the very start of training?

2. What typically happens to the learning rate after warmup ends?