Off policy control
Q learning estimates the optimal action value function directly. Its update target uses the reward plus the discounted maximum Q over next actions, regardless of which action the agent actually takes next. That max is why Q learning is off policy: it learns about the greedy policy while behaving more exploratory.
The update
Each step nudges the current action value toward the target:
- Observe state, action, reward, and next state.
- Build the target as reward plus gamma times the best next action value.
- Move the estimate a learning rate step toward that target.
Convergence requirements
Tabular Q learning converges to the optimal action values with probability one under classic conditions:
- Every state action pair is visited infinitely often.
- The learning rates satisfy the Robbins Monro conditions: they sum to infinity but their squares sum to a finite value.
- Rewards are bounded.
The infinite visitation requirement is why exploration matters; a purely greedy agent might never sample the pairs it needs.
Key idea
Tabular Q learning converges to optimal action values when every state action pair is visited infinitely often and learning rates meet the Robbins Monro conditions, which is why exploration is essential.