Learning without a model
Temporal difference (TD) learning estimates value functions from sampled experience, with no model of transitions or rewards. After each step it nudges the current estimate toward a target built from the observed reward and the next state's estimate.
The TD error
The heart of TD is the TD error: the reward plus the discounted estimate of the next state, minus the estimate of the current state. You move the current value a small step, scaled by a learning rate, in the direction of this error.
- If the target exceeds the estimate, the value rises.
- If it falls short, the value drops.
Bootstrapping
TD bootstraps, meaning its target relies on another learned estimate rather than waiting for a full return. This contrasts with Monte Carlo, which waits until an episode ends to use the actual return. Bootstrapping lets TD learn online, step by step, and often with lower variance, at the cost of some bias from imperfect estimates.
Key idea
TD learning updates value estimates online using a bootstrapped target of reward plus the discounted next estimate, trading a little bias for low variance and model free, step by step learning.