← Lessons

quiz vs the machine

Platinum1830

Machine Learning

The Soft Actor Critic Algorithm

Maximizing reward and entropy together for sample efficient, robust off policy control.

7 min read · advanced · beat Platinum to climb

The maximum entropy objective

Soft Actor Critic (SAC) optimizes an objective that adds an entropy bonus to reward. The agent seeks high return while keeping its policy as random as possible. This encourages broad exploration, avoids premature collapse onto one strategy, and yields robust behavior.

Off policy and stable

SAC is off policy, learning from a replay buffer for strong sample efficiency, and combines several stabilizing ideas:

  • Two critics, taking the minimum of their estimates to fight overestimation.
  • A stochastic actor trained to match the soft value landscape.
  • A temperature parameter that weights entropy against reward, often tuned automatically to hit a target entropy.

Soft value functions

The value and action value functions become soft: their targets include the entropy term. The actor is updated by minimizing the KL divergence to the exponential of the soft action values, pushing it toward high value, high entropy actions. Continuous control benchmarks made SAC a default choice for robotics style tasks.

Key idea

SAC maximizes reward plus policy entropy off policy, using twin critics, a stochastic actor, and an auto tuned temperature to deliver sample efficient, stable, and exploratory continuous control.

Check yourself

Answer to earn rating on the learn ladder.

1. What does SAC add to the standard reward objective?

2. Why does SAC use two critics and take their minimum?

3. What does the temperature parameter control in SAC?