The architecture
The parameter server pattern splits roles. One set of machines, the servers, hold the authoritative copy of the model weights. Another set, the workers, compute gradients on data shards.
- A worker pulls the current weights from the servers.
- It computes gradients on its batch.
- It pushes those gradients back, and the servers apply the update.
Sharding the weights across several servers spreads the network load so no single server is overwhelmed.
Synchronous or asynchronous
- Synchronous: servers wait for all workers before updating. This matches single machine training but a slow worker, a straggler, holds everyone up.
- Asynchronous: each worker updates as soon as it finishes. There are no stragglers, but a worker may compute on slightly stale weights, adding noise to training.
Where it fits
Parameter servers were the dominant pattern before fast collective libraries. Today, all reduce often wins for dense models because it avoids a server tier, but the pattern still suits sparse workloads where each worker touches only a few parameters.
Key idea
A parameter server holds the master weights while workers pull weights and push gradients; the choice between synchronous and asynchronous updates trades straggler delay against stale gradient noise.