The Red Teaming of LLMs

Finding failures on purpose

Red teaming is the practice of deliberately probing a model to elicit harmful, biased, or policy violating outputs, so they can be fixed before release. It treats the model as an adversary would.

How it is done

Human red teamers craft tricky prompts across categories like violence, self harm, privacy, and deception.
Automated red teaming uses other models to generate large numbers of attack prompts and rank which succeed.
Findings are logged with the prompt, the harmful output, and a severity rating.

Why it matters

Average test sets miss rare but severe failures. Red teaming targets the tails.
Discovered failures become training data for refusals or reward model updates.
It produces evidence for risk assessments and model cards.

Good practice

Cover a taxonomy of harms so coverage is measurable, not ad hoc.
Track whether fixes generalize or just patch the specific prompt.
Re run red teaming after every major training change.

Key idea

Red teaming adversarially probes a model across a harm taxonomy, using human and automated attacks to surface rare severe failures that become training data and risk evidence.

The Red Teaming of LLMs

Finding failures on purpose

How it is done

Why it matters

Good practice

Key idea

Check yourself