What it is
Adam is the most widely used optimizer for deep learning. The name comes from adaptive moment estimation. It combines two ideas: momentum, which smooths the direction of updates, and per parameter learning rates, which scale the step for each weight.
The two moments
Adam tracks two running averages for every parameter.
- The first moment is an average of recent gradients, like momentum
- The second moment is an average of recent squared gradients
The update divides the smoothed gradient by the square root of the second moment. Weights with large, noisy gradients get smaller steps, while weights with small gradients get larger steps.
Bias correction
Both averages start at zero, so early in training they are biased toward zero. Adam applies a bias correction factor that grows the estimates so the first few steps are not too small.
Why people like it
- It works well with little tuning
- It handles sparse and noisy gradients
- It usually trains faster than plain gradient descent
A common variant called AdamW decouples weight decay from the gradient update and is now standard for training large language models.
Key idea
Adam adapts a separate learning rate for each weight using running averages of the gradient and its square.