The idea
AdaGrad was an early adaptive optimizer that gives every parameter its own learning rate based on how much it has been updated so far. Parameters touched rarely keep a large rate, while heavily updated ones slow down.
The mechanism
AdaGrad accumulates the sum of squared gradients for each parameter across all of training:
- It divides the raw gradient by the square root of this running total
- Frequently active parameters build a large total and so take ever smaller steps
- Sparse parameters, common in text and recommendation data, stay nimble
This per feature behavior makes AdaGrad strong on sparse problems where most features rarely fire.
The catch
Because the accumulated sum only grows, the effective learning rate keeps shrinking and can stall before reaching a good solution. RMSProp and Adam fix this by using a decaying moving average instead of an unbounded sum, so old gradients are gradually forgotten.
Key idea
AdaGrad scales steps by the accumulated squared gradient per parameter, helping sparse features but eventually shrinking learning rates too far.