Finding failures on purpose
Red teaming is the practice of deliberately probing a model to elicit harmful, biased, or policy violating outputs, so they can be fixed before release. It treats the model as an adversary would.
How it is done
- Human red teamers craft tricky prompts across categories like violence, self harm, privacy, and deception.
- Automated red teaming uses other models to generate large numbers of attack prompts and rank which succeed.
- Findings are logged with the prompt, the harmful output, and a severity rating.
Why it matters
- Average test sets miss rare but severe failures. Red teaming targets the tails.
- Discovered failures become training data for refusals or reward model updates.
- It produces evidence for risk assessments and model cards.
Good practice
- Cover a taxonomy of harms so coverage is measurable, not ad hoc.
- Track whether fixes generalize or just patch the specific prompt.
- Re run red teaming after every major training change.
Key idea
Red teaming adversarially probes a model across a harm taxonomy, using human and automated attacks to surface rare severe failures that become training data and risk evidence.