The goal
Knowledge distillation trains a small student model to copy the behavior of a large teacher model. The aim is to keep most of the teacher's quality while running far cheaper.
Soft targets
The key insight is that a teacher's full probability distribution carries more information than the single correct label. These soft targets reveal how the teacher views relationships between classes, such as a cat being more like a dog than a car.
- The student trains on the teacher's output probabilities
- A temperature softens those probabilities to expose the structure
- The loss pushes the student to match the softened distribution
- True labels are often mixed in as well
Why it helps
By learning from the teacher's nuanced outputs, the student often beats a same sized model trained only on hard labels. Distillation is widely used to compress large language models into faster deployable versions, sometimes combined with quantization for even smaller footprints.
Key idea
Distillation trains a small student to match a large teacher's softened output distribution, transferring nuanced knowledge beyond hard labels.