The maximization bias
Standard Q learning uses the same estimates to both select and evaluate the best next action. Because the max of noisy estimates tends to be too high, this systematically overestimates action values, a problem known as maximization bias. It is worst when many actions have similar true values.
Two estimators
Double Q learning keeps two independent value tables. On each update it randomly picks one to update, then:
- Uses the first table to select the greedy next action.
- Uses the second table to evaluate that selected action.
Because the selecting estimator and the evaluating estimator are decoupled, the upward bias of the max largely cancels. Neither table both chooses and grades the same action.
In deep RL
The same idea powers Double DQN, which reuses the online network to select the action and the target network to evaluate it, with almost no extra cost. This reduces overestimation and stabilizes learning on Atari and similar benchmarks.
Key idea
Double Q learning splits selection and evaluation across two estimators so the maximization bias of taking a max over noisy values cancels, yielding less overoptimistic action values.