The goal
A large accurate model may be too slow or costly to serve. Knowledge distillation trains a smaller student to reproduce the behavior of a large teacher, keeping much of the quality at a fraction of the cost.
Learning from soft targets
- The teacher outputs a full probability distribution, not just the top label.
- These soft targets carry information about how classes relate, the so called dark knowledge.
- A temperature softens the distribution so small probabilities become informative.
The loss
- The student matches the teacher distribution, usually with a KL divergence term.
- It often also fits the true hard labels with a standard cross entropy term.
- A weight balances mimicking the teacher against the ground truth.
Beyond logits
- Feature distillation matches intermediate representations, not just outputs.
- Sequence level distillation trains a student on teacher generated outputs for tasks like translation.
- Distillation pairs well with quantization and pruning for compounding efficiency gains.
Key idea
Distillation trains a small student to match a large teacher soft target distribution, transferring dark knowledge so the student keeps most quality at much lower serving cost.