Q Learning

Q learning is a temporal difference control algorithm that learns the optimal action value function directly, regardless of how the agent currently behaves.

The update

Q learning keeps a table or function of Q values, the expected return for each state action pair. After each step it updates the value toward a target built from the best next action:

The target is the reward plus the discounted maximum Q over next actions.
The Q value moves a small learning rate step toward this target.

Taking the maximum is what makes the method learn optimal values rather than the value of the current policy.

Off policy

Q learning is off policy: it learns about the greedy optimal policy while following a different, more exploratory policy. The agent can wander and still converge to optimal Q values, because the max in the update always references the best action.

Convergence

With enough exploration of every state action pair and a decaying learning rate, tabular Q learning provably converges to the optimal Q function. The optimal policy is then just acting greedily with respect to those values.

Practical notes

Q learning underlies deep RL when the table is replaced by a neural network. The max operator can cause overestimation, which later variants address.

Key idea

Q learning is an off policy TD method that updates action values toward the reward plus the maximum next Q, converging to the optimal action value function.

Q Learning

Q Learning

The update

Off policy

Convergence

Practical notes

Key idea

Check yourself