The Eval Harness for Safety

Making safety measurable

A safety eval harness is an automated suite that runs a fixed battery of probes against a model and scores the outputs, so safety can be tracked like any other metric across versions.

What it contains

Prompt sets spanning a harm taxonomy, plus jailbreak and injection attempts.
Graders: rules, classifiers, or an LLM judge that label each output as safe or violating.
Metrics such as violation rate, refusal correctness, and over refusal of benign prompts.

Why a harness not ad hoc checks

It is reproducible, so the same probes run on every candidate model.
It catches regressions, where a new version becomes less safe even as it gets more capable.
It lets teams gate releases on quantitative thresholds.

Design cautions

Guard against contamination: if probes leak into training, scores inflate and mean nothing.
Track both under and over refusal, since a model that refuses everything looks safe but is useless.
Refresh probes regularly, because attackers adapt and static suites go stale.
An LLM grader has its own biases, so calibrate it against human labels.

Key idea

A safety eval harness runs reproducible probe and attack sets through automated graders to track violation and over refusal rates across versions, gating releases while guarding against contamination, grader bias, and stale probes.

The Eval Harness for Safety

Making safety measurable

What it contains

Why a harness not ad hoc checks

Design cautions

Key idea

Check yourself