← Lessons

quiz vs the machine

Gold1430

Machine Learning

The RMSProp Optimizer

Scaling each parameter step by a running estimate of its gradient size.

4 min read · core · beat Gold to climb

The motivation

Plain gradient descent uses one learning rate for every parameter. But some parameters see large, frequent gradients while others see tiny ones, so a single step size suits none of them perfectly. RMSProp gives each parameter its own adaptive step.

How it works

RMSProp keeps a moving average of recent squared gradients for each parameter:

  • It tracks how big that parameter's gradients have typically been
  • It then divides the raw gradient by the square root of this average
  • Parameters with large recent gradients get smaller effective steps, and quiet parameters get larger ones

This keeps updates balanced across very different scales and tames the wild oscillations that plain descent suffers on steep, narrow valleys.

Relation to others

RMSProp is the per parameter scaling half of the popular Adam optimizer, which adds momentum on top. Because it adapts on the fly, RMSProp is robust to messy, nonstationary objectives like those in recurrent networks.

Key idea

RMSProp divides each gradient by a running estimate of its recent magnitude, giving every parameter an adaptive learning rate that stabilizes training.

Check yourself

Answer to earn rating on the learn ladder.

1. What does RMSProp track for each parameter?

2. A parameter with consistently large gradients gets what under RMSProp?