The RMSProp Optimizer

The motivation

Plain gradient descent uses one learning rate for every parameter. But some parameters see large, frequent gradients while others see tiny ones, so a single step size suits none of them perfectly. RMSProp gives each parameter its own adaptive step.

How it works

RMSProp keeps a moving average of recent squared gradients for each parameter:

It tracks how big that parameter's gradients have typically been
It then divides the raw gradient by the square root of this average
Parameters with large recent gradients get smaller effective steps, and quiet parameters get larger ones

This keeps updates balanced across very different scales and tames the wild oscillations that plain descent suffers on steep, narrow valleys.

Relation to others

RMSProp is the per parameter scaling half of the popular Adam optimizer, which adds momentum on top. Because it adapts on the fly, RMSProp is robust to messy, nonstationary objectives like those in recurrent networks.

Key idea

RMSProp divides each gradient by a running estimate of its recent magnitude, giving every parameter an adaptive learning rate that stabilizes training.

The RMSProp Optimizer

The motivation

How it works

Relation to others

Key idea

Check yourself