Automating the grader
Human evaluation is slow, so teams use a strong model as an LLM judge. The judge reads a prompt and a candidate answer, then returns a score or a verdict following a rubric in its instructions. This scales grading to thousands of examples cheaply.
How it is set up
- A prompt template states the criteria and the output format.
- The judge may give a numeric score, a label, or a short justification.
- Asking for reasoning before the verdict often improves reliability.
When tuned against human labels, a good judge can track human preference closely.
Known biases
LLM judges carry systematic flaws:
- Position bias, favoring whichever answer appears first.
- Verbosity bias, rewarding longer answers regardless of quality.
- Self preference, scoring outputs from its own model family higher.
- Leniency, drifting toward high scores when unsure.
Making it trustworthy
Calibrate the judge against a human labeled set and report correlation. Swap answer order and average to cancel position bias. Constrain the output format and pin the judge model version so scores stay stable over time. Treat the judge as an instrument that must itself be validated.
Key idea
An LLM judge scales grading cheaply and can track human preference, but only after you correct for position, verbosity, and self preference biases and validate it against human labels.