No waiting for stragglers
In asynchronous SGD, workers do not wait for each other. Each worker pulls the current weights, computes a gradient, and applies its update whenever it is ready, often through a parameter server.
- Fast workers are never blocked by slow ones.
- Updates arrive in an unpredictable order.
- Throughput is high because hardware is rarely idle.
The cost of staleness
The catch is stale gradients. By the time a worker pushes its update, the weights may have moved on, so its gradient was computed against an old version. Mild staleness is tolerable, but heavy staleness can slow or destabilize convergence.
- Staleness grows with the number of workers.
- Bounded staleness limits how far behind a worker may be.
- It suits clusters with uneven worker speeds.
Independent updates
Removing the synchronization barrier buys speed but admits gradients from older weight versions.
Key idea
Asynchronous SGD lets workers update whenever ready without a barrier, gaining throughput at the cost of stale gradients that can hurt convergence.