Evaluation Of Generative Models

Measure sample quality and diversity when there is no single ground truth answer.

Evaluation Of Generative Models

Evaluating a generative model is hard because there is no single correct output. A good model must produce samples that are both realistic and diverse, and these can trade off.

Two qualities to balance

Fidelity means each sample looks like real data.
Diversity means the samples cover the full range of the real distribution.
A model can score high on one while failing the other, as mode collapse shows.

Common metrics

Inception score rewards samples that are confidently classified and varied across classes, but it ignores the real data directly.
Frechet inception distance, or FID, compares the statistics of real and generated features and is the most widely used image metric. Lower is better.
Precision and recall for generative models separate fidelity from diversity into two numbers.
For likelihood based models, held out log likelihood gives a direct score.

Watch outs

Metrics can be gamed and do not perfectly match human judgment.
Human evaluation remains a gold standard for perceptual quality, despite its cost.

Key idea

Generative models are judged on both fidelity and diversity using metrics like FID and precision recall, but no metric is perfect, so human evaluation stays an important reference.

Evaluation Of Generative Models

Evaluation Of Generative Models

Two qualities to balance

Common metrics

Watch outs

Key idea

Check yourself