SARSA

SARSA is an on policy temporal difference control algorithm. Its name comes from the tuple it uses: state, action, reward, next state, next action.

The update

Like Q learning, SARSA keeps action values and updates them each step. The difference is the target. SARSA uses the value of the action the policy actually takes next, not the maximum:

The target is the reward plus the discounted Q of the next state and next chosen action.
The next action comes from the same exploratory policy the agent follows.

On policy

Because the update uses the action the agent will really take, SARSA is on policy: it learns the value of the policy it is currently following, including its exploration. Q learning by contrast assumes greedy next actions.

A practical difference

This makes SARSA more cautious near danger. If exploration sometimes takes risky actions, SARSA accounts for that cost and learns a safer path, while Q learning learns the optimal path assuming greedy behavior. The classic cliff walking example shows SARSA hugging a safer route.

Convergence

With a policy that gradually becomes greedy as exploration decays, SARSA converges to the optimal policy too, but along the way it values the behavior it actually performs.

Key idea

SARSA is an on policy TD method that updates toward the value of the next action actually taken, learning the value of the policy it follows including its exploration.

SARSA

SARSA

The update

On policy

A practical difference

Convergence

Key idea

Check yourself