← Lessons

quiz vs the machine

Gold1420

Machine Learning

The Q Learning Convergence Conditions

When the classic off policy control algorithm provably finds optimal action values.

6 min read · core · beat Gold to climb

Off policy control

Q learning estimates the optimal action value function directly. Its update target uses the reward plus the discounted maximum Q over next actions, regardless of which action the agent actually takes next. That max is why Q learning is off policy: it learns about the greedy policy while behaving more exploratory.

The update

Each step nudges the current action value toward the target:

  • Observe state, action, reward, and next state.
  • Build the target as reward plus gamma times the best next action value.
  • Move the estimate a learning rate step toward that target.

Convergence requirements

Tabular Q learning converges to the optimal action values with probability one under classic conditions:

  • Every state action pair is visited infinitely often.
  • The learning rates satisfy the Robbins Monro conditions: they sum to infinity but their squares sum to a finite value.
  • Rewards are bounded.

The infinite visitation requirement is why exploration matters; a purely greedy agent might never sample the pairs it needs.

Key idea

Tabular Q learning converges to optimal action values when every state action pair is visited infinitely often and learning rates meet the Robbins Monro conditions, which is why exploration is essential.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is Q learning called off policy?

2. Which condition is required for tabular Q learning convergence?

3. What do the Robbins Monro conditions require of learning rates?