The Distillation For Efficiency

The goal

A large accurate model may be too slow or costly to serve. Knowledge distillation trains a smaller student to reproduce the behavior of a large teacher, keeping much of the quality at a fraction of the cost.

Learning from soft targets

The teacher outputs a full probability distribution, not just the top label.
These soft targets carry information about how classes relate, the so called dark knowledge.
A temperature softens the distribution so small probabilities become informative.

The loss

The student matches the teacher distribution, usually with a KL divergence term.
It often also fits the true hard labels with a standard cross entropy term.
A weight balances mimicking the teacher against the ground truth.

Beyond logits

Feature distillation matches intermediate representations, not just outputs.
Sequence level distillation trains a student on teacher generated outputs for tasks like translation.
Distillation pairs well with quantization and pruning for compounding efficiency gains.

Key idea

Distillation trains a small student to match a large teacher soft target distribution, transferring dark knowledge so the student keeps most quality at much lower serving cost.

The Distillation For Efficiency

The goal

Learning from soft targets

The loss

Beyond logits

Key idea

Check yourself