← Lessons

quiz vs the machine

Silver1120

Machine Learning

The LLM as a Judge Pattern

Using a strong model to score the outputs of another model.

5 min read · intro · beat Silver to climb

What it is

LLM as a judge uses one language model to grade the output of another. Instead of a human reading every answer, you prompt a capable model with the question, the candidate answer, and a rubric, then ask for a score or a verdict.

Common modes

  • Single answer grading: the judge rates one response against criteria like correctness and clarity.
  • Pairwise comparison: the judge sees two answers and picks the better one, which is often more reliable than absolute scores.
  • Reference based: the judge compares the answer to a known gold answer.

Pitfalls

Judges are useful but biased.

  • Position bias: the judge may favor the first answer it reads, so swap the order and average.
  • Verbosity bias: longer answers can look more thorough even when wrong.
  • Self preference: a model may favor text in its own style.

Good practice is to validate the judge against a small set of human labels before trusting it at scale.

Key idea

An LLM judge scales evaluation by scoring answers against a rubric, but you must control for position and verbosity bias and check it against human labels.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is pairwise comparison often preferred over absolute scoring with an LLM judge?

2. What is position bias in an LLM judge?