Most weights are redundant
Large networks are over parameterized, so many weights contribute little. Pruning removes the least important weights, leaving a sparse model with fewer nonzero parameters.
Unstructured versus structured
- Unstructured pruning zeros out individual weights anywhere. It reaches high sparsity but the scattered zeros are hard for hardware to exploit.
- Structured pruning removes whole channels, filters, or heads. It gives real speedups because the remaining computation is still dense and regular.
The trade off is flexibility versus hardware friendliness.
A typical workflow
Pruning usually alternates with fine tuning to recover lost accuracy.
Realizing the speedup
Sparsity only saves time if the hardware can skip the zeros. Some GPUs support structured sparsity patterns, such as two nonzeros in every group of four, that the tensor cores accelerate directly. Without such support, unstructured sparsity mainly saves storage rather than compute.
Key idea
Pruning removes unimportant weights to create sparse models, and structured patterns matched to hardware turn that sparsity into real speed, not just smaller size.