← Lessons

quiz vs the machine

Gold1470

Machine Learning

Distributed All Reduce

The collective that sums gradients across GPUs efficiently with ring algorithms.

6 min read · core · beat Gold to climb

What it does

All reduce is the collective operation that combines a value from every GPU and gives the combined result back to all of them. In training, the value is the gradient and the combine is a sum or average, so after all reduce every device holds the same averaged gradient.

The naive approach is slow

A simple plan sends every GPU's gradient to one central node, sums them, and broadcasts back. That central node becomes a bottleneck, and traffic grows with the number of devices.

Ring all reduce

The ring algorithm arranges GPUs in a circle and runs two phases.

  • Reduce scatter: each GPU sends a chunk to its neighbor and adds the chunk it receives, repeated until each GPU owns the full sum of one chunk.
  • All gather: the completed chunks are passed around the ring until every GPU has every chunk.

The clever part is that the data each GPU sends is independent of the total device count. Bandwidth use is roughly constant per device, so ring all reduce scales well to many GPUs.

Key idea

All reduce sums gradients across all GPUs and returns the result to each; the ring algorithm avoids a central bottleneck by passing chunks around a circle with near constant bandwidth per device.

Check yourself

Answer to earn rating on the learn ladder.

1. After an all reduce of gradients, what does each GPU hold?

2. Why is ring all reduce preferred over sending everything to one node?

3. What are the two phases of ring all reduce?