← Lessons

quiz vs the machine

Gold1420

Machine Learning

The All Reduce Collective

Sum gradients across every device and hand each one the same result.

4 min read · core · beat Gold to climb

A collective that decentralizes

All reduce is the workhorse of synchronous data parallel training. It takes a value held on each device, combines them with a reduction such as a sum, and gives every device the identical combined result.

  • Input is one tensor per device.
  • The reduction is typically a sum, then a divide for the average.
  • Every device ends with the same output tensor.

Why it beats a central hub

Unlike a parameter server, all reduce has no central node. The devices cooperate as peers, so there is no single bottleneck and the pattern scales better. After the collective, all replicas hold averaged gradients and apply matching updates.

  • It keeps replicas bit consistent in their updates.
  • It is communication bound, so its efficiency matters a lot.
  • Efficient ring and tree algorithms implement it.

Combine and broadcast

The collective both reduces and distributes, so no extra broadcast step is needed.

Key idea

All reduce sums each device gradient and returns the same result to all of them with no central node, keeping data parallel replicas synchronized.

Check yourself

Answer to earn rating on the learn ladder.

1. What does all reduce return to each device?

2. How does all reduce differ from a parameter server?