Measuring quality
To improve a prompt or system you need to measure it. An evaluation rubric defines explicit criteria and a scoring scale so outputs are judged consistently rather than by gut feel. Good rubrics turn vague quality into repeatable numbers.
Building a rubric
- Name the dimensions that matter, such as accuracy, relevance, and format.
- Define each level of the scale so scorers agree on what a three means.
- Anchor with examples of high and low scoring outputs.
Ways to score
- Human raters apply the rubric to a sample of outputs.
- Reference based checks compare against a known correct answer.
- Model as judge uses a strong model to apply the rubric at scale, which is fast but needs its own validation.
Pitfalls
A judge model can be biased toward verbose or confident answers and may favor its own style. Calibrate it against human labels on a sample. Keep a fixed evaluation set so you compare changes fairly over time, and watch for the test set leaking into prompts.
Key idea
An evaluation rubric defines explicit criteria and a scale so model outputs are scored consistently, and any model judge must be calibrated against human labels to be trusted.