What pruning does
Pruning removes parts of a network judged to contribute little, shrinking the model. The hope is that many weights are near zero or redundant, so cutting them loses little quality while saving memory and compute.
Unstructured versus structured
- Unstructured pruning zeroes out individual weights wherever they are small. It can reach high sparsity but the leftover pattern is irregular, so ordinary hardware rarely runs it faster without special support.
- Structured pruning removes whole rows, heads, or layers. The result is a smaller dense model that runs faster on normal hardware, but it is coarser and can hurt accuracy more.
How weights are chosen
A common signal is magnitude: small weights are assumed less important. Better methods weigh each parameter by its effect on the loss or on layer outputs, sometimes using calibration data, much like careful quantization.
Recovering quality
After heavy pruning, a short fine tuning pass lets remaining weights adjust and recover most lost accuracy.
Key idea
Pruning deletes low importance weights or structures, with structured pruning giving real speedups on common hardware and a fine tuning pass restoring accuracy.