Q Learning
Q learning is a temporal difference control algorithm that learns the optimal action value function directly, regardless of how the agent currently behaves.
The update
Q learning keeps a table or function of Q values, the expected return for each state action pair. After each step it updates the value toward a target built from the best next action:
- The target is the reward plus the discounted maximum Q over next actions.
- The Q value moves a small learning rate step toward this target.
Taking the maximum is what makes the method learn optimal values rather than the value of the current policy.
Off policy
Q learning is off policy: it learns about the greedy optimal policy while following a different, more exploratory policy. The agent can wander and still converge to optimal Q values, because the max in the update always references the best action.
Convergence
With enough exploration of every state action pair and a decaying learning rate, tabular Q learning provably converges to the optimal Q function. The optimal policy is then just acting greedily with respect to those values.
Practical notes
Q learning underlies deep RL when the table is replaced by a neural network. The max operator can cause overestimation, which later variants address.
Key idea
Q learning is an off policy TD method that updates action values toward the reward plus the maximum next Q, converging to the optimal action value function.