Temporal Difference Learning
Temporal difference learning, or TD, blends Monte Carlo and dynamic programming. It learns from experience like Monte Carlo but updates after every step like dynamic programming.
Bootstrapping
The key trick is bootstrapping: instead of waiting for the full return, TD updates a value toward the immediate reward plus the current estimate of the next state's value. It learns a guess from a guess.
- After one step it computes a TD target: reward plus discounted next value.
- The gap between the target and the old estimate is the TD error.
- The value moves a small step toward the target.
Online and incremental
Because TD updates each step, it learns during an episode and works on tasks that never terminate. It does not need to store entire episodes, making it memory efficient and fast to react.
The bias variance trade
TD has lower variance than Monte Carlo because each update depends on only one reward, but it is biased while the value estimates are still wrong. As learning proceeds the bias shrinks and estimates converge.
Why it dominates
Most practical algorithms, including Q learning and SARSA, are TD methods. Bootstrapping from current estimates makes learning sample efficient and well suited to ongoing control.
Key idea
Temporal difference learning bootstraps, updating a value each step toward the reward plus the estimated next value, giving fast online learning with lower variance than Monte Carlo.