Value functions
The state value under a policy is the expected discounted return starting from that state and following the policy. The optimal value is the best achievable value from each state over all policies. Knowing the optimal values lets you act greedily and behave optimally.
The recursion
The Bellman optimality equation says the optimal value of a state equals the value of its best action. That best action value is the expected immediate reward plus the discounted optimal value of wherever you land:
- Consider every action available in the state.
- For each, take the expected reward plus gamma times the optimal value of the next state.
- The optimal value of the state is the maximum over those actions.
Why it matters
This equation is a fixed point. The optimal value function is the unique solution, and most planning and learning algorithms are just different ways of solving it. The max over actions is what makes it nonlinear and distinguishes it from the plain Bellman expectation equation for a fixed policy.
Key idea
The Bellman optimality equation expresses each optimal state value as the maximum over actions of immediate reward plus the discounted optimal value of the successor, a fixed point that defines optimal behavior.