← Lessons

quiz vs the machine

Gold1390

Machine Learning

AdaGrad And Adaptive Gradients

Per feature learning rates that shrink as gradients accumulate.

4 min read · core · beat Gold to climb

The idea

AdaGrad was an early adaptive optimizer that gives every parameter its own learning rate based on how much it has been updated so far. Parameters touched rarely keep a large rate, while heavily updated ones slow down.

The mechanism

AdaGrad accumulates the sum of squared gradients for each parameter across all of training:

  • It divides the raw gradient by the square root of this running total
  • Frequently active parameters build a large total and so take ever smaller steps
  • Sparse parameters, common in text and recommendation data, stay nimble

This per feature behavior makes AdaGrad strong on sparse problems where most features rarely fire.

The catch

Because the accumulated sum only grows, the effective learning rate keeps shrinking and can stall before reaching a good solution. RMSProp and Adam fix this by using a decaying moving average instead of an unbounded sum, so old gradients are gradually forgotten.

Key idea

AdaGrad scales steps by the accumulated squared gradient per parameter, helping sparse features but eventually shrinking learning rates too far.

Check yourself

Answer to earn rating on the learn ladder.

1. What weakness of AdaGrad do RMSProp and Adam fix?

2. AdaGrad is especially well suited to what kind of data?