What it is
LLM as a judge uses one language model to grade the output of another. Instead of a human reading every answer, you prompt a capable model with the question, the candidate answer, and a rubric, then ask for a score or a verdict.
Common modes
- Single answer grading: the judge rates one response against criteria like correctness and clarity.
- Pairwise comparison: the judge sees two answers and picks the better one, which is often more reliable than absolute scores.
- Reference based: the judge compares the answer to a known gold answer.
Pitfalls
Judges are useful but biased.
- Position bias: the judge may favor the first answer it reads, so swap the order and average.
- Verbosity bias: longer answers can look more thorough even when wrong.
- Self preference: a model may favor text in its own style.
Good practice is to validate the judge against a small set of human labels before trusting it at scale.
Key idea
An LLM judge scales evaluation by scoring answers against a rubric, but you must control for position and verbosity bias and check it against human labels.