Knowledge Distillation

The goal

Knowledge distillation trains a small student model to copy the behavior of a large teacher model. The aim is to keep most of the teacher's quality while running far cheaper.

Soft targets

The key insight is that a teacher's full probability distribution carries more information than the single correct label. These soft targets reveal how the teacher views relationships between classes, such as a cat being more like a dog than a car.

The student trains on the teacher's output probabilities
A temperature softens those probabilities to expose the structure
The loss pushes the student to match the softened distribution
True labels are often mixed in as well

Why it helps

By learning from the teacher's nuanced outputs, the student often beats a same sized model trained only on hard labels. Distillation is widely used to compress large language models into faster deployable versions, sometimes combined with quantization for even smaller footprints.

Key idea

Distillation trains a small student to match a large teacher's softened output distribution, transferring nuanced knowledge beyond hard labels.

Knowledge Distillation

The goal

Soft targets

Why it helps

Key idea

Check yourself