← Lessons

quiz vs the machine

Gold1350

Machine Learning

Model Pruning for LLMs

Removing weights or whole structures from a network to make it smaller and faster.

5 min read · core · beat Gold to climb

What pruning does

Pruning removes parts of a network judged to contribute little, shrinking the model. The hope is that many weights are near zero or redundant, so cutting them loses little quality while saving memory and compute.

Unstructured versus structured

  • Unstructured pruning zeroes out individual weights wherever they are small. It can reach high sparsity but the leftover pattern is irregular, so ordinary hardware rarely runs it faster without special support.
  • Structured pruning removes whole rows, heads, or layers. The result is a smaller dense model that runs faster on normal hardware, but it is coarser and can hurt accuracy more.

How weights are chosen

A common signal is magnitude: small weights are assumed less important. Better methods weigh each parameter by its effect on the loss or on layer outputs, sometimes using calibration data, much like careful quantization.

Recovering quality

After heavy pruning, a short fine tuning pass lets remaining weights adjust and recover most lost accuracy.

Key idea

Pruning deletes low importance weights or structures, with structured pruning giving real speedups on common hardware and a fine tuning pass restoring accuracy.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does structured pruning give real speedups on normal hardware?

2. What is a common signal for choosing weights to prune?