The problem
Language models are fluent, which makes wrong answers sound confident. A hallucination is a claim the model presents as fact that is false or unsupported. Factuality evaluation asks how often output is actually true.
How factuality is measured
- Closed book QA against verified answer keys.
- Claim extraction, splitting an answer into atomic claims and checking each.
- Attribution checks, verifying that cited sources actually support the claim.
- Faithfulness checks, confirming a summary stays true to its source.
Scores often report the fraction of supported claims rather than a single right or wrong verdict.
Grounded versus open
A grounded task gives the model a source, so a hallucination means contradicting that source. An open task has no source, so factuality is checked against world knowledge or a reference, which is harder and noisier.
Reducing false confidence
Good evals also reward calibrated abstention: saying I am not sure beats inventing an answer. Penalizing confident wrong claims more than honest uncertainty pushes models toward safer behavior. Retrieval and citation requirements further tie answers to checkable evidence.
Key idea
Factuality evaluation decomposes answers into claims and checks each against evidence, rewarding supported statements and calibrated abstention while penalizing the confident fabrications that fluency makes dangerous.