← Lessons

quiz vs the machine

Gold1460

Machine Learning

Adam and AdamW

The default optimizer and its weight decay correction.

5 min read · core · beat Gold to climb

Adam and AdamW

Adam is the most widely used optimizer because it combines momentum with per parameter scaling, giving fast and robust convergence with little tuning.

What Adam tracks

  • A first moment, the moving average of gradients, acting like momentum.
  • A second moment, the moving average of squared gradients, acting like RMSProp.
  • A bias correction so early estimates are not skewed toward zero.

How it updates

Adam divides the smoothed gradient by the square root of the smoothed squared gradient. This gives each weight an adaptive step that already carries inertia. The result is a method that works well across many architectures with default settings, which is why it is the common starting choice.

The AdamW fix

Standard Adam folds weight decay into the gradient, where the adaptive scaling distorts it so decay does not act uniformly. AdamW decouples weight decay, applying it directly to the weights as a separate shrink step. This restores proper regularization and usually generalizes better, making AdamW the modern default for training large models.

Key idea

Adam blends momentum and per parameter scaling with bias correction, and AdamW decouples weight decay for cleaner regularization.

Check yourself

Answer to earn rating on the learn ladder.

1. Which two ideas does Adam combine?

2. What problem does AdamW fix?

3. Why does Adam apply bias correction?