A bandwidth optimal layout
Ring all reduce implements the all reduce collective by placing devices in a logical ring. Each device only ever talks to its two neighbors, and the gradient tensor is chopped into chunks that flow around the ring.
- Devices form a ring and pass chunks to the next neighbor.
- A reduce scatter phase sums chunks around the ring.
- An all gather phase circulates the finished sums to everyone.
Why the bandwidth is flat
Each device sends and receives the same volume regardless of how many devices there are. The data each device moves does not grow with the ring size, so per device bandwidth stays roughly constant as you add more devices. That is what makes it scale.
- Latency grows with ring length, bandwidth does not.
- It is the basis of many production training stacks.
- It needs reliable links since one slow device stalls the ring.
Flowing around the ring
Chunks travel neighbor to neighbor, summing on the way out and sharing on the way back.
Key idea
Ring all reduce splits gradients into chunks passed neighbor to neighbor, keeping per device bandwidth flat as the number of devices grows.