The Actor Critic Architecture

Actor critic methods combine the strengths of policy gradients and value learning. An actor chooses actions while a critic evaluates them, and the two train together.

Two networks one loop

The actor is the policy. It outputs actions and is updated by a policy gradient.
The critic estimates a value function. It tells the actor how good its actions were.

The critic's estimate replaces the noisy full return used in plain policy gradients, dramatically lowering variance while keeping the update informative.

The advantage signal

The critic typically supplies an advantage, the difference between the action's value and the state's baseline value. A positive advantage pushes the actor to make that action more likely; a negative one pushes it down. This is the same idea as a baseline, computed online.

Why it works well

The critic enables bootstrapped updates, so the actor can learn step by step rather than waiting for full episodes.
Variance drops compared to REINFORCE, speeding learning.
It scales to continuous control and large networks.

Caveats

The critic introduces bias because its estimates are imperfect, so actor critic trades some bias for much less variance. Methods like A2C and A3C, and later PPO, build on this foundation.

Key idea

Actor critic pairs a policy actor with a value critic, using the critic's advantage estimate to give the actor low variance, bootstrapped policy gradient updates.

The Actor Critic Architecture