What safety evals check
Safety evaluation measures whether a model produces harmful content: toxic language, dangerous instructions, harassment, or disallowed material. It pairs prompts that should be refused with prompts that should be answered, then scores both.
Two error types
- Harmful compliance, answering a request that should be refused.
- Over refusal, blocking a harmless request out of excess caution.
A good model minimizes both. Reporting only refusals hides over refusal, so evals track the harmless side too.
Adversarial robustness
Static prompts undersell risk because attackers adapt. Red teaming and jailbreak suites probe with role play, obfuscation, and multi turn pressure to see whether safety holds when challenged. A model safe on plain prompts can fail under a clever framing.
Scoring and limits
A classifier or LLM judge often labels each response as safe or unsafe, calibrated against human review. Categories are subjective and culturally dependent, so taxonomies must be explicit. Coverage is never complete, so a passing score means safe against the tested attacks, not safe in general.
Key idea
Safety evaluation balances harmful compliance against over refusal and stress tests with adversarial red teaming, but coverage gaps mean a passing score certifies only the attacks that were tested.