← Lessons

quiz vs the machine

Platinum1860

Machine Learning

The Eval Harness for Safety

How an automated test suite tracks safety regressions across model versions.

6 min read · advanced · beat Platinum to climb

Making safety measurable

A safety eval harness is an automated suite that runs a fixed battery of probes against a model and scores the outputs, so safety can be tracked like any other metric across versions.

What it contains

  • Prompt sets spanning a harm taxonomy, plus jailbreak and injection attempts.
  • Graders: rules, classifiers, or an LLM judge that label each output as safe or violating.
  • Metrics such as violation rate, refusal correctness, and over refusal of benign prompts.

Why a harness not ad hoc checks

  • It is reproducible, so the same probes run on every candidate model.
  • It catches regressions, where a new version becomes less safe even as it gets more capable.
  • It lets teams gate releases on quantitative thresholds.

Design cautions

  • Guard against contamination: if probes leak into training, scores inflate and mean nothing.
  • Track both under and over refusal, since a model that refuses everything looks safe but is useless.
  • Refresh probes regularly, because attackers adapt and static suites go stale.
  • An LLM grader has its own biases, so calibrate it against human labels.

Key idea

A safety eval harness runs reproducible probe and attack sets through automated graders to track violation and over refusal rates across versions, gating releases while guarding against contamination, grader bias, and stale probes.

Check yourself

Answer to earn rating on the learn ladder.

1. Why use a reproducible harness instead of ad hoc safety checks?

2. Why must a harness track over refusal too?

3. What is the contamination risk for a safety harness?