← Lessons

quiz vs the machine

Gold1400

Machine Learning

The LLM as a Judge

Using a strong model to grade outputs at scale, and the biases that come with it.

6 min read · core · beat Gold to climb

Automating the grader

Human evaluation is slow, so teams use a strong model as an LLM judge. The judge reads a prompt and a candidate answer, then returns a score or a verdict following a rubric in its instructions. This scales grading to thousands of examples cheaply.

How it is set up

  • A prompt template states the criteria and the output format.
  • The judge may give a numeric score, a label, or a short justification.
  • Asking for reasoning before the verdict often improves reliability.

When tuned against human labels, a good judge can track human preference closely.

Known biases

LLM judges carry systematic flaws:

  • Position bias, favoring whichever answer appears first.
  • Verbosity bias, rewarding longer answers regardless of quality.
  • Self preference, scoring outputs from its own model family higher.
  • Leniency, drifting toward high scores when unsure.

Making it trustworthy

Calibrate the judge against a human labeled set and report correlation. Swap answer order and average to cancel position bias. Constrain the output format and pin the judge model version so scores stay stable over time. Treat the judge as an instrument that must itself be validated.

Key idea

An LLM judge scales grading cheaply and can track human preference, but only after you correct for position, verbosity, and self preference biases and validate it against human labels.

Check yourself

Answer to earn rating on the learn ladder.

1. Which bias makes an LLM judge favor whichever answer is shown first?

2. How can position bias be reduced when using an LLM judge?

3. Why must an LLM judge be validated against human labels?