← Lessons

quiz vs the machine

Gold1460

Machine Learning

The Gradient Compression

Shrink the gradients sent over the network to ease communication limits.

4 min read · core · beat Gold to climb

When the network is the bottleneck

At large scale, exchanging gradients can cost more than computing them. Gradient compression reduces the number of bits sent each step so communication stops being the limiting factor.

  • Quantization sends gradients in fewer bits.
  • Sparsification sends only the largest gradient entries.
  • A residual carries the dropped part into the next step.

Keeping training correct

Compression introduces error, so good methods track what they did not send. Error feedback stores the residual locally and adds it back next round, so on average no gradient mass is lost and convergence stays close to the uncompressed run.

  • Top entries dominate, so sparsification loses little.
  • Error feedback prevents systematic bias.
  • Aggressive compression can still slow convergence.

Compress and correct

The residual loop is what lets heavy compression remain accurate over many steps.

Key idea

Gradient compression sends fewer bits per step through quantization or sparsification, using error feedback to retain dropped mass and keep convergence on track.

Check yourself

Answer to earn rating on the learn ladder.

1. What does error feedback store?

2. Why does gradient compression help at large scale?