The All Reduce Collective

A collective that decentralizes

All reduce is the workhorse of synchronous data parallel training. It takes a value held on each device, combines them with a reduction such as a sum, and gives every device the identical combined result.

Input is one tensor per device.
The reduction is typically a sum, then a divide for the average.
Every device ends with the same output tensor.

Why it beats a central hub

Unlike a parameter server, all reduce has no central node. The devices cooperate as peers, so there is no single bottleneck and the pattern scales better. After the collective, all replicas hold averaged gradients and apply matching updates.

It keeps replicas bit consistent in their updates.
It is communication bound, so its efficiency matters a lot.
Efficient ring and tree algorithms implement it.

Combine and broadcast

The collective both reduces and distributes, so no extra broadcast step is needed.

Key idea

All reduce sums each device gradient and returns the same result to all of them with no central node, keeping data parallel replicas synchronized.

The All Reduce Collective

A collective that decentralizes

Why it beats a central hub

Combine and broadcast

Key idea

Check yourself