The LLM as a Judge Pattern

What it is

LLM as a judge uses one language model to grade the output of another. Instead of a human reading every answer, you prompt a capable model with the question, the candidate answer, and a rubric, then ask for a score or a verdict.

Common modes

Single answer grading: the judge rates one response against criteria like correctness and clarity.
Pairwise comparison: the judge sees two answers and picks the better one, which is often more reliable than absolute scores.
Reference based: the judge compares the answer to a known gold answer.

Pitfalls

Judges are useful but biased.

Position bias: the judge may favor the first answer it reads, so swap the order and average.
Verbosity bias: longer answers can look more thorough even when wrong.
Self preference: a model may favor text in its own style.

Good practice is to validate the judge against a small set of human labels before trusting it at scale.

Key idea

An LLM judge scales evaluation by scoring answers against a rubric, but you must control for position and verbosity bias and check it against human labels.

The LLM as a Judge Pattern

What it is

Common modes

Pitfalls

Key idea

Check yourself