The Q Learning Convergence Conditions

When the classic off policy control algorithm provably finds optimal action values.

Off policy control

Q learning estimates the optimal action value function directly. Its update target uses the reward plus the discounted maximum Q over next actions, regardless of which action the agent actually takes next. That max is why Q learning is off policy: it learns about the greedy policy while behaving more exploratory.

The update

Each step nudges the current action value toward the target:

Observe state, action, reward, and next state.
Build the target as reward plus gamma times the best next action value.
Move the estimate a learning rate step toward that target.

Convergence requirements

Tabular Q learning converges to the optimal action values with probability one under classic conditions:

Every state action pair is visited infinitely often.
The learning rates satisfy the Robbins Monro conditions: they sum to infinity but their squares sum to a finite value.
Rewards are bounded.

The infinite visitation requirement is why exploration matters; a purely greedy agent might never sample the pairs it needs.

Key idea

Tabular Q learning converges to optimal action values when every state action pair is visited infinitely often and learning rates meet the Robbins Monro conditions, which is why exploration is essential.

The Q Learning Convergence Conditions

Off policy control

The update

Convergence requirements

Key idea

Check yourself