RMSProp
Different parameters often need different step sizes, yet a single global learning rate gives them all the same. RMSProp adapts the step per parameter using the recent magnitude of its gradients.
The mechanism
- Maintain a moving average of the squared gradient for each weight.
- Divide each gradient by the square root of that average before stepping.
- Add a tiny constant to avoid dividing by zero.
Why it helps
Weights with consistently large gradients get their effective step shrunk, while weights with tiny gradients get a relative boost. This balances progress across dimensions and tames the wild oscillations that a fixed rate would cause in steep directions. It is especially useful on the non stationary objectives common in recurrent networks.
Relationship to other methods
RMSProp grew out of fixing an earlier method that let the squared gradient sum grow forever and stall learning. By using a decaying average instead of a full sum, RMSProp keeps adapting throughout training. Adam later combined this per parameter scaling with momentum into one popular optimizer.
Key idea
RMSProp scales each weight's step by a decaying average of its squared gradients, balancing progress across dimensions.