← Lessons

quiz vs the machine

Gold1430

Machine Learning

SARSA

An on policy cousin of Q learning that learns the policy it follows.

4 min read · core · beat Gold to climb

SARSA

SARSA is an on policy temporal difference control algorithm. Its name comes from the tuple it uses: state, action, reward, next state, next action.

The update

Like Q learning, SARSA keeps action values and updates them each step. The difference is the target. SARSA uses the value of the action the policy actually takes next, not the maximum:

  • The target is the reward plus the discounted Q of the next state and next chosen action.
  • The next action comes from the same exploratory policy the agent follows.

On policy

Because the update uses the action the agent will really take, SARSA is on policy: it learns the value of the policy it is currently following, including its exploration. Q learning by contrast assumes greedy next actions.

A practical difference

This makes SARSA more cautious near danger. If exploration sometimes takes risky actions, SARSA accounts for that cost and learns a safer path, while Q learning learns the optimal path assuming greedy behavior. The classic cliff walking example shows SARSA hugging a safer route.

Convergence

With a policy that gradually becomes greedy as exploration decays, SARSA converges to the optimal policy too, but along the way it values the behavior it actually performs.

Key idea

SARSA is an on policy TD method that updates toward the value of the next action actually taken, learning the value of the policy it follows including its exploration.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the second A in SARSA refer to?

2. Why is SARSA called on policy?