The Bellman Equation

The Bellman equation is the heart of reinforcement learning. It expresses the value of a state in terms of the immediate reward plus the discounted value of where you land next.

The recursive idea

Instead of summing an infinite future directly, the Bellman equation says:

The value of a state equals the expected immediate reward plus the discounted value of the next state.
This holds for every state, giving a system of equations that the true value function satisfies.

By unrolling one step at a time the long term return collapses into a relationship between neighbors. This recursion is why values can be computed without simulating forever.

Expectation and discount

The next state may be random, so the equation takes an expectation over transitions. The discount factor between zero and one shrinks distant rewards, keeping the sum finite and encoding a preference for sooner payoffs.

Two flavors

There is a Bellman equation for the value of a fixed policy, and the Bellman optimality equation that uses the best action at each step. The optimal value function is the unique solution to the optimality form.

Why it matters

Almost every RL algorithm is a way of solving or approximating the Bellman equation, whether by exact iteration or by sampling. It turns a daunting infinite horizon into local updates.

Key idea

The Bellman equation writes a state's value as immediate reward plus the discounted value of its successor, turning long horizons into local updates.

The Bellman Equation