Evaluation Of Generative Models
Evaluating a generative model is hard because there is no single correct output. A good model must produce samples that are both realistic and diverse, and these can trade off.
Two qualities to balance
- Fidelity means each sample looks like real data.
- Diversity means the samples cover the full range of the real distribution.
- A model can score high on one while failing the other, as mode collapse shows.
Common metrics
- Inception score rewards samples that are confidently classified and varied across classes, but it ignores the real data directly.
- Frechet inception distance, or FID, compares the statistics of real and generated features and is the most widely used image metric. Lower is better.
- Precision and recall for generative models separate fidelity from diversity into two numbers.
- For likelihood based models, held out log likelihood gives a direct score.
Watch outs
- Metrics can be gamed and do not perfectly match human judgment.
- Human evaluation remains a gold standard for perceptual quality, despite its cost.
Key idea
Generative models are judged on both fidelity and diversity using metrics like FID and precision recall, but no metric is perfect, so human evaluation stays an important reference.