← Lessons

quiz vs the machine

Platinum1820

Machine Learning

The Actor Critic Architecture

Combining a policy and a value estimate for lower variance learning.

5 min read · advanced · beat Platinum to climb

The Actor Critic Architecture

Actor critic methods combine the strengths of policy gradients and value learning. An actor chooses actions while a critic evaluates them, and the two train together.

Two networks one loop

  • The actor is the policy. It outputs actions and is updated by a policy gradient.
  • The critic estimates a value function. It tells the actor how good its actions were.

The critic's estimate replaces the noisy full return used in plain policy gradients, dramatically lowering variance while keeping the update informative.

The advantage signal

The critic typically supplies an advantage, the difference between the action's value and the state's baseline value. A positive advantage pushes the actor to make that action more likely; a negative one pushes it down. This is the same idea as a baseline, computed online.

Why it works well

  • The critic enables bootstrapped updates, so the actor can learn step by step rather than waiting for full episodes.
  • Variance drops compared to REINFORCE, speeding learning.
  • It scales to continuous control and large networks.

Caveats

The critic introduces bias because its estimates are imperfect, so actor critic trades some bias for much less variance. Methods like A2C and A3C, and later PPO, build on this foundation.

Key idea

Actor critic pairs a policy actor with a value critic, using the critic's advantage estimate to give the actor low variance, bootstrapped policy gradient updates.

Check yourself

Answer to earn rating on the learn ladder.

1. What are the roles of the actor and the critic?

2. How does the critic help the actor compared to plain REINFORCE?

3. What tradeoff does the critic introduce?