Learning Rate Warmup

The problem at the start

At the very beginning of training, the weights are random and the gradient estimates are unreliable. Adaptive optimizers like Adam also have noisy moment estimates in the first steps. A large learning rate here can push the weights into a bad region or cause the loss to diverge.

What warmup does

Warmup starts the learning rate at a small value and increases it gradually over the first several hundred or thousand steps until it reaches the target rate.

A linear warmup raises the rate in equal increments
After warmup the schedule usually decays, often with a cosine curve
The warmup length is a tunable number of steps

Why it matters for big models

Large transformers and large batch sizes are especially sensitive. Without warmup, the early updates can destabilize layer normalization statistics and the attention weights, leading to loss spikes that never recover.

A typical recipe

Warm up linearly for a few thousand steps
Hold or peak at the target rate
Decay slowly toward zero for the rest of training

Warmup is cheap insurance. It costs a little time early on and greatly improves the odds of a stable run.

Key idea