Adam and AdamW

Adam is the most widely used optimizer because it combines momentum with per parameter scaling, giving fast and robust convergence with little tuning.

What Adam tracks

A first moment, the moving average of gradients, acting like momentum.
A second moment, the moving average of squared gradients, acting like RMSProp.
A bias correction so early estimates are not skewed toward zero.

How it updates

Adam divides the smoothed gradient by the square root of the smoothed squared gradient. This gives each weight an adaptive step that already carries inertia. The result is a method that works well across many architectures with default settings, which is why it is the common starting choice.

The AdamW fix

Standard Adam folds weight decay into the gradient, where the adaptive scaling distorts it so decay does not act uniformly. AdamW decouples weight decay, applying it directly to the weights as a separate shrink step. This restores proper regularization and usually generalizes better, making AdamW the modern default for training large models.

Key idea

Adam blends momentum and per parameter scaling with bias correction, and AdamW decouples weight decay for cleaner regularization.

Adam and AdamW

Adam and AdamW

What Adam tracks

How it updates

The AdamW fix

Key idea

Check yourself