← Lessons

quiz vs the machine

Gold1490

Machine Learning

Evaluation Of Generative Models

Measure sample quality and diversity when there is no single ground truth answer.

5 min read · core · beat Gold to climb

Evaluation Of Generative Models

Evaluating a generative model is hard because there is no single correct output. A good model must produce samples that are both realistic and diverse, and these can trade off.

Two qualities to balance

  • Fidelity means each sample looks like real data.
  • Diversity means the samples cover the full range of the real distribution.
  • A model can score high on one while failing the other, as mode collapse shows.

Common metrics

  • Inception score rewards samples that are confidently classified and varied across classes, but it ignores the real data directly.
  • Frechet inception distance, or FID, compares the statistics of real and generated features and is the most widely used image metric. Lower is better.
  • Precision and recall for generative models separate fidelity from diversity into two numbers.
  • For likelihood based models, held out log likelihood gives a direct score.

Watch outs

  • Metrics can be gamed and do not perfectly match human judgment.
  • Human evaluation remains a gold standard for perceptual quality, despite its cost.

Key idea

Generative models are judged on both fidelity and diversity using metrics like FID and precision recall, but no metric is perfect, so human evaluation stays an important reference.

Check yourself

Answer to earn rating on the learn ladder.

1. What two qualities should a generative model balance?

2. What does a lower Frechet inception distance indicate?

3. Why is human evaluation still used?