← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The Ring All Reduce

Arrange devices in a ring so all reduce bandwidth stays flat with scale.

5 min read · advanced · beat Platinum to climb

A bandwidth optimal layout

Ring all reduce implements the all reduce collective by placing devices in a logical ring. Each device only ever talks to its two neighbors, and the gradient tensor is chopped into chunks that flow around the ring.

  • Devices form a ring and pass chunks to the next neighbor.
  • A reduce scatter phase sums chunks around the ring.
  • An all gather phase circulates the finished sums to everyone.

Why the bandwidth is flat

Each device sends and receives the same volume regardless of how many devices there are. The data each device moves does not grow with the ring size, so per device bandwidth stays roughly constant as you add more devices. That is what makes it scale.

  • Latency grows with ring length, bandwidth does not.
  • It is the basis of many production training stacks.
  • It needs reliable links since one slow device stalls the ring.

Flowing around the ring

Chunks travel neighbor to neighbor, summing on the way out and sharing on the way back.

Key idea

Ring all reduce splits gradients into chunks passed neighbor to neighbor, keeping per device bandwidth flat as the number of devices grows.

Check yourself

Answer to earn rating on the learn ladder.

1. Who does each device communicate with in ring all reduce?

2. Why does ring all reduce scale well in bandwidth?

3. What are the two phases of ring all reduce?