Two networks, two jobs
Advantage actor critic (A2C) combines policy gradients with learned value estimates. The actor is the policy that picks actions. The critic is a value function that judges them. The critic supplies a low variance signal so the actor need not rely on noisy full returns.
The advantage signal
The actor's update uses the advantage, how much better an action was than the critic expected. A common estimate is the TD error: reward plus the discounted next state value minus the current state value. Positive advantage means the action beat expectations, so its probability rises.
- The critic learns by minimizing TD error, like value based learning.
- The actor learns by scaling its log probability gradient by the advantage.
Bias variance tradeoff
Using a bootstrapped advantage instead of a full return trades a little bias for much lower variance, the same tradeoff as TD versus Monte Carlo. This makes actor critic methods more sample efficient and stable than plain REINFORCE, and they form the backbone of modern algorithms like PPO.
Key idea
Actor critic pairs a policy actor with a value critic; the critic provides an advantage signal that replaces noisy returns, lowering variance and making policy gradient learning stable and efficient.