The LLM Evaluation Rubric

Measuring quality

To improve a prompt or system you need to measure it. An evaluation rubric defines explicit criteria and a scoring scale so outputs are judged consistently rather than by gut feel. Good rubrics turn vague quality into repeatable numbers.

Building a rubric

Name the dimensions that matter, such as accuracy, relevance, and format.
Define each level of the scale so scorers agree on what a three means.
Anchor with examples of high and low scoring outputs.

Ways to score

Human raters apply the rubric to a sample of outputs.
Reference based checks compare against a known correct answer.
Model as judge uses a strong model to apply the rubric at scale, which is fast but needs its own validation.

Pitfalls

A judge model can be biased toward verbose or confident answers and may favor its own style. Calibrate it against human labels on a sample. Keep a fixed evaluation set so you compare changes fairly over time, and watch for the test set leaking into prompts.

Key idea

An evaluation rubric defines explicit criteria and a scale so model outputs are scored consistently, and any model judge must be calibrated against human labels to be trusted.

The LLM Evaluation Rubric

Measuring quality

Building a rubric

Ways to score

Pitfalls

Key idea

Check yourself