The Adam Optimizer

What it is

Adam is the most widely used optimizer for deep learning. The name comes from adaptive moment estimation. It combines two ideas: momentum, which smooths the direction of updates, and per parameter learning rates, which scale the step for each weight.

The two moments

Adam tracks two running averages for every parameter.

The first moment is an average of recent gradients, like momentum
The second moment is an average of recent squared gradients

The update divides the smoothed gradient by the square root of the second moment. Weights with large, noisy gradients get smaller steps, while weights with small gradients get larger steps.

Bias correction

Both averages start at zero, so early in training they are biased toward zero. Adam applies a bias correction factor that grows the estimates so the first few steps are not too small.

Why people like it

It works well with little tuning
It handles sparse and noisy gradients
It usually trains faster than plain gradient descent

A common variant called AdamW decouples weight decay from the gradient update and is now standard for training large language models.

Key idea