The Safety and Toxicity Eval

Measuring harmful output and how robust a model stays under adversarial pressure.

What safety evals check

Safety evaluation measures whether a model produces harmful content: toxic language, dangerous instructions, harassment, or disallowed material. It pairs prompts that should be refused with prompts that should be answered, then scores both.

Two error types

Harmful compliance, answering a request that should be refused.
Over refusal, blocking a harmless request out of excess caution.

A good model minimizes both. Reporting only refusals hides over refusal, so evals track the harmless side too.

Adversarial robustness

Static prompts undersell risk because attackers adapt. Red teaming and jailbreak suites probe with role play, obfuscation, and multi turn pressure to see whether safety holds when challenged. A model safe on plain prompts can fail under a clever framing.

Scoring and limits

A classifier or LLM judge often labels each response as safe or unsafe, calibrated against human review. Categories are subjective and culturally dependent, so taxonomies must be explicit. Coverage is never complete, so a passing score means safe against the tested attacks, not safe in general.

Key idea