The Pruning and Sparsity

Most weights are redundant

Large networks are over parameterized, so many weights contribute little. Pruning removes the least important weights, leaving a sparse model with fewer nonzero parameters.

Unstructured versus structured

Unstructured pruning zeros out individual weights anywhere. It reaches high sparsity but the scattered zeros are hard for hardware to exploit.
Structured pruning removes whole channels, filters, or heads. It gives real speedups because the remaining computation is still dense and regular.

The trade off is flexibility versus hardware friendliness.

A typical workflow

Pruning usually alternates with fine tuning to recover lost accuracy.

Realizing the speedup

Sparsity only saves time if the hardware can skip the zeros. Some GPUs support structured sparsity patterns, such as two nonzeros in every group of four, that the tensor cores accelerate directly. Without such support, unstructured sparsity mainly saves storage rather than compute.

Key idea

Pruning removes unimportant weights to create sparse models, and structured patterns matched to hardware turn that sparsity into real speed, not just smaller size.

The Pruning and Sparsity

Most weights are redundant

Unstructured versus structured

A typical workflow

Realizing the speedup

Key idea

Check yourself