← Lessons

quiz vs the machine

Platinum1760

Machine Learning

Knowledge Distillation

Training a small student model to imitate a large teacher.

5 min read · advanced · beat Platinum to climb

The goal

Knowledge distillation trains a small student model to copy the behavior of a large teacher model. The aim is to keep most of the teacher's quality while running far cheaper.

Soft targets

The key insight is that a teacher's full probability distribution carries more information than the single correct label. These soft targets reveal how the teacher views relationships between classes, such as a cat being more like a dog than a car.

  • The student trains on the teacher's output probabilities
  • A temperature softens those probabilities to expose the structure
  • The loss pushes the student to match the softened distribution
  • True labels are often mixed in as well

Why it helps

By learning from the teacher's nuanced outputs, the student often beats a same sized model trained only on hard labels. Distillation is widely used to compress large language models into faster deployable versions, sometimes combined with quantization for even smaller footprints.

Key idea

Distillation trains a small student to match a large teacher's softened output distribution, transferring nuanced knowledge beyond hard labels.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the student model learn from in distillation?

2. Why use a temperature on the teacher's outputs?