← Lessons

quiz vs the machine

Gold1430

Machine Learning

The Asynchronous SGD

Let workers update without waiting, accepting staleness for throughput.

4 min read · core · beat Gold to climb

No waiting for stragglers

In asynchronous SGD, workers do not wait for each other. Each worker pulls the current weights, computes a gradient, and applies its update whenever it is ready, often through a parameter server.

  • Fast workers are never blocked by slow ones.
  • Updates arrive in an unpredictable order.
  • Throughput is high because hardware is rarely idle.

The cost of staleness

The catch is stale gradients. By the time a worker pushes its update, the weights may have moved on, so its gradient was computed against an old version. Mild staleness is tolerable, but heavy staleness can slow or destabilize convergence.

  • Staleness grows with the number of workers.
  • Bounded staleness limits how far behind a worker may be.
  • It suits clusters with uneven worker speeds.

Independent updates

Removing the synchronization barrier buys speed but admits gradients from older weight versions.

Key idea

Asynchronous SGD lets workers update whenever ready without a barrier, gaining throughput at the cost of stale gradients that can hurt convergence.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main drawback of asynchronous SGD?

2. Why does asynchronous SGD get high throughput?