← Lessons

quiz vs the machine

Silver1080

Machine Learning

The Adam Optimizer

The default optimizer that adapts each weight's step size on the fly.

4 min read · intro · beat Silver to climb

What it is

Adam is the most widely used optimizer for deep learning. The name comes from adaptive moment estimation. It combines two ideas: momentum, which smooths the direction of updates, and per parameter learning rates, which scale the step for each weight.

The two moments

Adam tracks two running averages for every parameter.

  • The first moment is an average of recent gradients, like momentum
  • The second moment is an average of recent squared gradients

The update divides the smoothed gradient by the square root of the second moment. Weights with large, noisy gradients get smaller steps, while weights with small gradients get larger steps.

Bias correction

Both averages start at zero, so early in training they are biased toward zero. Adam applies a bias correction factor that grows the estimates so the first few steps are not too small.

Why people like it

  • It works well with little tuning
  • It handles sparse and noisy gradients
  • It usually trains faster than plain gradient descent

A common variant called AdamW decouples weight decay from the gradient update and is now standard for training large language models.

Key idea

Adam adapts a separate learning rate for each weight using running averages of the gradient and its square.

Check yourself

Answer to earn rating on the learn ladder.

1. What two quantities does Adam track for each parameter?

2. Why does Adam apply bias correction?

3. What does the AdamW variant change?