The maximum entropy objective
Soft Actor Critic (SAC) optimizes an objective that adds an entropy bonus to reward. The agent seeks high return while keeping its policy as random as possible. This encourages broad exploration, avoids premature collapse onto one strategy, and yields robust behavior.
Off policy and stable
SAC is off policy, learning from a replay buffer for strong sample efficiency, and combines several stabilizing ideas:
- Two critics, taking the minimum of their estimates to fight overestimation.
- A stochastic actor trained to match the soft value landscape.
- A temperature parameter that weights entropy against reward, often tuned automatically to hit a target entropy.
Soft value functions
The value and action value functions become soft: their targets include the entropy term. The actor is updated by minimizing the KL divergence to the exponential of the soft action values, pushing it toward high value, high entropy actions. Continuous control benchmarks made SAC a default choice for robotics style tasks.
Key idea
SAC maximizes reward plus policy entropy off policy, using twin critics, a stochastic actor, and an auto tuned temperature to deliver sample efficient, stable, and exploratory continuous control.