A collective that decentralizes
All reduce is the workhorse of synchronous data parallel training. It takes a value held on each device, combines them with a reduction such as a sum, and gives every device the identical combined result.
- Input is one tensor per device.
- The reduction is typically a sum, then a divide for the average.
- Every device ends with the same output tensor.
Why it beats a central hub
Unlike a parameter server, all reduce has no central node. The devices cooperate as peers, so there is no single bottleneck and the pattern scales better. After the collective, all replicas hold averaged gradients and apply matching updates.
- It keeps replicas bit consistent in their updates.
- It is communication bound, so its efficiency matters a lot.
- Efficient ring and tree algorithms implement it.
Combine and broadcast
The collective both reduces and distributes, so no extra broadcast step is needed.
Key idea
All reduce sums each device gradient and returns the same result to all of them with no central node, keeping data parallel replicas synchronized.