The Policy and Value Function

Two central objects in reinforcement learning are the policy, which says what to do, and the value function, which says how good a situation is.

The policy

A policy maps states to actions. It can be deterministic, always choosing one action per state, or stochastic, giving a probability over actions. The goal of learning is to find a policy that earns high return.

The value function

A value function estimates expected future return.

The state value is the expected return starting from a state and following the policy.
The action value, often called Q, is the expected return after taking a specific action in a state and then following the policy.

Values let the agent compare situations and choices without simulating the whole future every time.

How they connect

Given a value function the agent can improve its policy by preferring high value actions. Given a policy the agent can estimate its values. This back and forth between evaluating and improving is the engine behind most RL methods.

A policy that is greedy with respect to accurate action values is optimal, since it always picks the choice with the highest expected return.

Key idea

The policy decides actions while the value function scores expected return, and the two refine each other toward optimal behavior.

The Policy and Value Function