← Lessons

quiz vs the machine

Gold1370

Machine Learning

The Distillation For Efficiency

Training a small student to mimic a large teacher and keep most of its quality.

5 min read · core · beat Gold to climb

The goal

A large accurate model may be too slow or costly to serve. Knowledge distillation trains a smaller student to reproduce the behavior of a large teacher, keeping much of the quality at a fraction of the cost.

Learning from soft targets

  • The teacher outputs a full probability distribution, not just the top label.
  • These soft targets carry information about how classes relate, the so called dark knowledge.
  • A temperature softens the distribution so small probabilities become informative.

The loss

  • The student matches the teacher distribution, usually with a KL divergence term.
  • It often also fits the true hard labels with a standard cross entropy term.
  • A weight balances mimicking the teacher against the ground truth.

Beyond logits

  • Feature distillation matches intermediate representations, not just outputs.
  • Sequence level distillation trains a student on teacher generated outputs for tasks like translation.
  • Distillation pairs well with quantization and pruning for compounding efficiency gains.

Key idea

Distillation trains a small student to match a large teacher soft target distribution, transferring dark knowledge so the student keeps most quality at much lower serving cost.

Check yourself

Answer to earn rating on the learn ladder.

1. What extra information do teacher soft targets provide over hard labels?

2. What does raising the distillation temperature do?