The Temporal Difference Learning Deep Dive

Learning value estimates from raw experience by bootstrapping off later estimates.

Learning without a model

Temporal difference (TD) learning estimates value functions from sampled experience, with no model of transitions or rewards. After each step it nudges the current estimate toward a target built from the observed reward and the next state's estimate.

The TD error

The heart of TD is the TD error: the reward plus the discounted estimate of the next state, minus the estimate of the current state. You move the current value a small step, scaled by a learning rate, in the direction of this error.

If the target exceeds the estimate, the value rises.
If it falls short, the value drops.

Bootstrapping

TD bootstraps, meaning its target relies on another learned estimate rather than waiting for a full return. This contrasts with Monte Carlo, which waits until an episode ends to use the actual return. Bootstrapping lets TD learn online, step by step, and often with lower variance, at the cost of some bias from imperfect estimates.

Key idea

TD learning updates value estimates online using a bootstrapped target of reward plus the discounted next estimate, trading a little bias for low variance and model free, step by step learning.

The Temporal Difference Learning Deep Dive

Learning without a model

The TD error

Bootstrapping

Key idea

Check yourself