The motivation
Plain gradient descent uses one learning rate for every parameter. But some parameters see large, frequent gradients while others see tiny ones, so a single step size suits none of them perfectly. RMSProp gives each parameter its own adaptive step.
How it works
RMSProp keeps a moving average of recent squared gradients for each parameter:
- It tracks how big that parameter's gradients have typically been
- It then divides the raw gradient by the square root of this average
- Parameters with large recent gradients get smaller effective steps, and quiet parameters get larger ones
This keeps updates balanced across very different scales and tames the wild oscillations that plain descent suffers on steep, narrow valleys.
Relation to others
RMSProp is the per parameter scaling half of the popular Adam optimizer, which adds momentum on top. Because it adapts on the fly, RMSProp is robust to messy, nonstationary objectives like those in recurrent networks.
Key idea
RMSProp divides each gradient by a running estimate of its recent magnitude, giving every parameter an adaptive learning rate that stabilizes training.