The Advantage Actor Critic Method

Pairing a learned policy with a learned value critic to cut policy gradient variance.

Two networks, two jobs

Advantage actor critic (A2C) combines policy gradients with learned value estimates. The actor is the policy that picks actions. The critic is a value function that judges them. The critic supplies a low variance signal so the actor need not rely on noisy full returns.

The advantage signal

The actor's update uses the advantage, how much better an action was than the critic expected. A common estimate is the TD error: reward plus the discounted next state value minus the current state value. Positive advantage means the action beat expectations, so its probability rises.

The critic learns by minimizing TD error, like value based learning.
The actor learns by scaling its log probability gradient by the advantage.

Bias variance tradeoff

Using a bootstrapped advantage instead of a full return trades a little bias for much lower variance, the same tradeoff as TD versus Monte Carlo. This makes actor critic methods more sample efficient and stable than plain REINFORCE, and they form the backbone of modern algorithms like PPO.

Key idea

Actor critic pairs a policy actor with a value critic; the critic provides an advantage signal that replaces noisy returns, lowering variance and making policy gradient learning stable and efficient.

The Advantage Actor Critic Method

Two networks, two jobs

The advantage signal

Bias variance tradeoff

Key idea

Check yourself