The Asynchronous SGD

No waiting for stragglers

In asynchronous SGD, workers do not wait for each other. Each worker pulls the current weights, computes a gradient, and applies its update whenever it is ready, often through a parameter server.

Fast workers are never blocked by slow ones.
Updates arrive in an unpredictable order.
Throughput is high because hardware is rarely idle.

The cost of staleness

The catch is stale gradients. By the time a worker pushes its update, the weights may have moved on, so its gradient was computed against an old version. Mild staleness is tolerable, but heavy staleness can slow or destabilize convergence.

Staleness grows with the number of workers.
Bounded staleness limits how far behind a worker may be.
It suits clusters with uneven worker speeds.

Independent updates

Removing the synchronization barrier buys speed but admits gradients from older weight versions.

Key idea

Asynchronous SGD lets workers update whenever ready without a barrier, gaining throughput at the cost of stale gradients that can hurt convergence.

The Asynchronous SGD

No waiting for stragglers

The cost of staleness

Independent updates

Key idea

Check yourself