The Synchronous SGD

Stepping together

Synchronous SGD keeps all workers in lockstep. Every worker computes a gradient on its data shard, then they wait at a barrier and combine gradients before any of them updates.

All workers use the same weight version each step.
A collective such as all reduce averages the gradients.
Every worker applies the identical update.

Clean but barrier bound

Because updates use the same weights, synchronous SGD behaves like one large batch and is easier to reason about and reproduce. The price is the straggler problem, since the slowest worker sets the pace for the whole step.

It avoids the staleness of asynchronous training.
Performance is limited by the slowest worker.
Backup workers can mitigate stragglers.

A shared barrier

The barrier guarantees consistency at the cost of waiting for the slowest participant.

Key idea

Synchronous SGD averages gradients at a barrier so all workers apply the same update, giving clean reproducible training but exposing the straggler problem.

The Synchronous SGD

Stepping together

Clean but barrier bound

A shared barrier

Key idea

Check yourself