Stepping together
Synchronous SGD keeps all workers in lockstep. Every worker computes a gradient on its data shard, then they wait at a barrier and combine gradients before any of them updates.
- All workers use the same weight version each step.
- A collective such as all reduce averages the gradients.
- Every worker applies the identical update.
Clean but barrier bound
Because updates use the same weights, synchronous SGD behaves like one large batch and is easier to reason about and reproduce. The price is the straggler problem, since the slowest worker sets the pace for the whole step.
- It avoids the staleness of asynchronous training.
- Performance is limited by the slowest worker.
- Backup workers can mitigate stragglers.
A shared barrier
The barrier guarantees consistency at the cost of waiting for the slowest participant.
Key idea
Synchronous SGD averages gradients at a barrier so all workers apply the same update, giving clean reproducible training but exposing the straggler problem.