Distributed All Reduce

The collective that sums gradients across GPUs efficiently with ring algorithms.

What it does

All reduce is the collective operation that combines a value from every GPU and gives the combined result back to all of them. In training, the value is the gradient and the combine is a sum or average, so after all reduce every device holds the same averaged gradient.

The naive approach is slow

A simple plan sends every GPU's gradient to one central node, sums them, and broadcasts back. That central node becomes a bottleneck, and traffic grows with the number of devices.

Ring all reduce

The ring algorithm arranges GPUs in a circle and runs two phases.

Reduce scatter: each GPU sends a chunk to its neighbor and adds the chunk it receives, repeated until each GPU owns the full sum of one chunk.
All gather: the completed chunks are passed around the ring until every GPU has every chunk.

The clever part is that the data each GPU sends is independent of the total device count. Bandwidth use is roughly constant per device, so ring all reduce scales well to many GPUs.

Key idea

All reduce sums gradients across all GPUs and returns the result to each; the ring algorithm avoids a central bottleneck by passing chunks around a circle with near constant bandwidth per device.

Distributed All Reduce

What it does

The naive approach is slow

Ring all reduce

Key idea

Check yourself