← Lessons

quiz vs the machine

Gold1420

Machine Learning

RMSProp

Adapting each weight's step size by its recent gradient scale.

4 min read · core · beat Gold to climb

RMSProp

Different parameters often need different step sizes, yet a single global learning rate gives them all the same. RMSProp adapts the step per parameter using the recent magnitude of its gradients.

The mechanism

  • Maintain a moving average of the squared gradient for each weight.
  • Divide each gradient by the square root of that average before stepping.
  • Add a tiny constant to avoid dividing by zero.

Why it helps

Weights with consistently large gradients get their effective step shrunk, while weights with tiny gradients get a relative boost. This balances progress across dimensions and tames the wild oscillations that a fixed rate would cause in steep directions. It is especially useful on the non stationary objectives common in recurrent networks.

Relationship to other methods

RMSProp grew out of fixing an earlier method that let the squared gradient sum grow forever and stall learning. By using a decaying average instead of a full sum, RMSProp keeps adapting throughout training. Adam later combined this per parameter scaling with momentum into one popular optimizer.

Key idea

RMSProp scales each weight's step by a decaying average of its squared gradients, balancing progress across dimensions.

Check yourself

Answer to earn rating on the learn ladder.

1. What does RMSProp track for each weight?

2. What happens to weights with consistently large gradients under RMSProp?