When the network is the bottleneck
At large scale, exchanging gradients can cost more than computing them. Gradient compression reduces the number of bits sent each step so communication stops being the limiting factor.
- Quantization sends gradients in fewer bits.
- Sparsification sends only the largest gradient entries.
- A residual carries the dropped part into the next step.
Keeping training correct
Compression introduces error, so good methods track what they did not send. Error feedback stores the residual locally and adds it back next round, so on average no gradient mass is lost and convergence stays close to the uncompressed run.
- Top entries dominate, so sparsification loses little.
- Error feedback prevents systematic bias.
- Aggressive compression can still slow convergence.
Compress and correct
The residual loop is what lets heavy compression remain accurate over many steps.
Key idea
Gradient compression sends fewer bits per step through quantization or sparsification, using error feedback to retain dropped mass and keep convergence on track.