The Gradient Compression

When the network is the bottleneck

At large scale, exchanging gradients can cost more than computing them. Gradient compression reduces the number of bits sent each step so communication stops being the limiting factor.

Quantization sends gradients in fewer bits.
Sparsification sends only the largest gradient entries.
A residual carries the dropped part into the next step.

Keeping training correct

Compression introduces error, so good methods track what they did not send. Error feedback stores the residual locally and adds it back next round, so on average no gradient mass is lost and convergence stays close to the uncompressed run.

Top entries dominate, so sparsification loses little.
Error feedback prevents systematic bias.
Aggressive compression can still slow convergence.

Compress and correct

The residual loop is what lets heavy compression remain accurate over many steps.

Key idea

Gradient compression sends fewer bits per step through quantization or sparsification, using error feedback to retain dropped mass and keep convergence on track.

The Gradient Compression

When the network is the bottleneck

Keeping training correct

Compress and correct

Key idea

Check yourself