The Soft Actor Critic Algorithm

Maximizing reward and entropy together for sample efficient, robust off policy control.

The maximum entropy objective

Soft Actor Critic (SAC) optimizes an objective that adds an entropy bonus to reward. The agent seeks high return while keeping its policy as random as possible. This encourages broad exploration, avoids premature collapse onto one strategy, and yields robust behavior.

Off policy and stable

SAC is off policy, learning from a replay buffer for strong sample efficiency, and combines several stabilizing ideas:

Two critics, taking the minimum of their estimates to fight overestimation.
A stochastic actor trained to match the soft value landscape.
A temperature parameter that weights entropy against reward, often tuned automatically to hit a target entropy.

Soft value functions

The value and action value functions become soft: their targets include the entropy term. The actor is updated by minimizing the KL divergence to the exponential of the soft action values, pushing it toward high value, high entropy actions. Continuous control benchmarks made SAC a default choice for robotics style tasks.

Key idea

SAC maximizes reward plus policy entropy off policy, using twin critics, a stochastic actor, and an auto tuned temperature to deliver sample efficient, stable, and exploratory continuous control.

The Soft Actor Critic Algorithm

The maximum entropy objective

Off policy and stable

Soft value functions

Key idea

Check yourself