Adam and AdamW
Adam is the most widely used optimizer because it combines momentum with per parameter scaling, giving fast and robust convergence with little tuning.
What Adam tracks
- A first moment, the moving average of gradients, acting like momentum.
- A second moment, the moving average of squared gradients, acting like RMSProp.
- A bias correction so early estimates are not skewed toward zero.
How it updates
Adam divides the smoothed gradient by the square root of the smoothed squared gradient. This gives each weight an adaptive step that already carries inertia. The result is a method that works well across many architectures with default settings, which is why it is the common starting choice.
The AdamW fix
Standard Adam folds weight decay into the gradient, where the adaptive scaling distorts it so decay does not act uniformly. AdamW decouples weight decay, applying it directly to the weights as a separate shrink step. This restores proper regularization and usually generalizes better, making AdamW the modern default for training large models.
Key idea
Adam blends momentum and per parameter scaling with bias correction, and AdamW decouples weight decay for cleaner regularization.