← Lessons

quiz vs the machine

Platinum1740

Machine Learning

The LLM Evaluation Rubric

Scoring model outputs against clear, repeatable criteria.

6 min read · advanced · beat Platinum to climb

Measuring quality

To improve a prompt or system you need to measure it. An evaluation rubric defines explicit criteria and a scoring scale so outputs are judged consistently rather than by gut feel. Good rubrics turn vague quality into repeatable numbers.

Building a rubric

  • Name the dimensions that matter, such as accuracy, relevance, and format.
  • Define each level of the scale so scorers agree on what a three means.
  • Anchor with examples of high and low scoring outputs.

Ways to score

  • Human raters apply the rubric to a sample of outputs.
  • Reference based checks compare against a known correct answer.
  • Model as judge uses a strong model to apply the rubric at scale, which is fast but needs its own validation.

Pitfalls

A judge model can be biased toward verbose or confident answers and may favor its own style. Calibrate it against human labels on a sample. Keep a fixed evaluation set so you compare changes fairly over time, and watch for the test set leaking into prompts.

Key idea

An evaluation rubric defines explicit criteria and a scale so model outputs are scored consistently, and any model judge must be calibrated against human labels to be trusted.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the purpose of an evaluation rubric?

2. When using a model as judge, you should

3. Why keep a fixed evaluation set?