← Lessons

quiz vs the machine

Gold1450

Machine Learning

Q Learning

An off policy method that learns the optimal action values directly.

5 min read · core · beat Gold to climb

Q Learning

Q learning is a temporal difference control algorithm that learns the optimal action value function directly, regardless of how the agent currently behaves.

The update

Q learning keeps a table or function of Q values, the expected return for each state action pair. After each step it updates the value toward a target built from the best next action:

  • The target is the reward plus the discounted maximum Q over next actions.
  • The Q value moves a small learning rate step toward this target.

Taking the maximum is what makes the method learn optimal values rather than the value of the current policy.

Off policy

Q learning is off policy: it learns about the greedy optimal policy while following a different, more exploratory policy. The agent can wander and still converge to optimal Q values, because the max in the update always references the best action.

Convergence

With enough exploration of every state action pair and a decaying learning rate, tabular Q learning provably converges to the optimal Q function. The optimal policy is then just acting greedily with respect to those values.

Practical notes

Q learning underlies deep RL when the table is replaced by a neural network. The max operator can cause overestimation, which later variants address.

Key idea

Q learning is an off policy TD method that updates action values toward the reward plus the maximum next Q, converging to the optimal action value function.

Check yourself

Answer to earn rating on the learn ladder.

1. What makes Q learning off policy?

2. What does the Q learning target use over next actions?

3. What problem can the max operator cause?