Model Pruning for LLMs

Removing weights or whole structures from a network to make it smaller and faster.

What pruning does

Pruning removes parts of a network judged to contribute little, shrinking the model. The hope is that many weights are near zero or redundant, so cutting them loses little quality while saving memory and compute.

Unstructured versus structured

Unstructured pruning zeroes out individual weights wherever they are small. It can reach high sparsity but the leftover pattern is irregular, so ordinary hardware rarely runs it faster without special support.
Structured pruning removes whole rows, heads, or layers. The result is a smaller dense model that runs faster on normal hardware, but it is coarser and can hurt accuracy more.

How weights are chosen

A common signal is magnitude: small weights are assumed less important. Better methods weigh each parameter by its effect on the loss or on layer outputs, sometimes using calibration data, much like careful quantization.

Recovering quality

After heavy pruning, a short fine tuning pass lets remaining weights adjust and recover most lost accuracy.

Key idea

Pruning deletes low importance weights or structures, with structured pruning giving real speedups on common hardware and a fine tuning pass restoring accuracy.

Model Pruning for LLMs

What pruning does

Unstructured versus structured

How weights are chosen

Recovering quality

Key idea

Check yourself