← Lessons

quiz vs the machine

Gold1380

Machine Learning

The Temporal Difference Learning Deep Dive

Learning value estimates from raw experience by bootstrapping off later estimates.

6 min read · core · beat Gold to climb

Learning without a model

Temporal difference (TD) learning estimates value functions from sampled experience, with no model of transitions or rewards. After each step it nudges the current estimate toward a target built from the observed reward and the next state's estimate.

The TD error

The heart of TD is the TD error: the reward plus the discounted estimate of the next state, minus the estimate of the current state. You move the current value a small step, scaled by a learning rate, in the direction of this error.

  • If the target exceeds the estimate, the value rises.
  • If it falls short, the value drops.

Bootstrapping

TD bootstraps, meaning its target relies on another learned estimate rather than waiting for a full return. This contrasts with Monte Carlo, which waits until an episode ends to use the actual return. Bootstrapping lets TD learn online, step by step, and often with lower variance, at the cost of some bias from imperfect estimates.

Key idea

TD learning updates value estimates online using a bootstrapped target of reward plus the discounted next estimate, trading a little bias for low variance and model free, step by step learning.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the TD error?

2. How does TD differ from Monte Carlo?

3. What does bootstrapping trade off?