The LLM as a Judge

Using a strong model to grade outputs at scale, and the biases that come with it.

Automating the grader

Human evaluation is slow, so teams use a strong model as an LLM judge. The judge reads a prompt and a candidate answer, then returns a score or a verdict following a rubric in its instructions. This scales grading to thousands of examples cheaply.

How it is set up

A prompt template states the criteria and the output format.
The judge may give a numeric score, a label, or a short justification.
Asking for reasoning before the verdict often improves reliability.

When tuned against human labels, a good judge can track human preference closely.

Known biases

LLM judges carry systematic flaws:

Position bias, favoring whichever answer appears first.
Verbosity bias, rewarding longer answers regardless of quality.
Self preference, scoring outputs from its own model family higher.
Leniency, drifting toward high scores when unsure.

Making it trustworthy

Calibrate the judge against a human labeled set and report correlation. Swap answer order and average to cancel position bias. Constrain the output format and pin the judge model version so scores stay stable over time. Treat the judge as an instrument that must itself be validated.

Key idea