AdaGrad And Adaptive Gradients

The idea

AdaGrad was an early adaptive optimizer that gives every parameter its own learning rate based on how much it has been updated so far. Parameters touched rarely keep a large rate, while heavily updated ones slow down.

The mechanism

AdaGrad accumulates the sum of squared gradients for each parameter across all of training:

It divides the raw gradient by the square root of this running total
Frequently active parameters build a large total and so take ever smaller steps
Sparse parameters, common in text and recommendation data, stay nimble

This per feature behavior makes AdaGrad strong on sparse problems where most features rarely fire.

The catch

Because the accumulated sum only grows, the effective learning rate keeps shrinking and can stall before reaching a good solution. RMSProp and Adam fix this by using a decaying moving average instead of an unbounded sum, so old gradients are gradually forgotten.

Key idea

AdaGrad scales steps by the accumulated squared gradient per parameter, helping sparse features but eventually shrinking learning rates too far.

AdaGrad And Adaptive Gradients

The idea

The mechanism

The catch

Key idea

Check yourself