← Lessons

quiz vs the machine

Gold1380

Machine Learning

The Red Teaming of LLMs

How adversarial probing surfaces harmful behaviors before users do.

5 min read · core · beat Gold to climb

Finding failures on purpose

Red teaming is the practice of deliberately probing a model to elicit harmful, biased, or policy violating outputs, so they can be fixed before release. It treats the model as an adversary would.

How it is done

  • Human red teamers craft tricky prompts across categories like violence, self harm, privacy, and deception.
  • Automated red teaming uses other models to generate large numbers of attack prompts and rank which succeed.
  • Findings are logged with the prompt, the harmful output, and a severity rating.

Why it matters

  • Average test sets miss rare but severe failures. Red teaming targets the tails.
  • Discovered failures become training data for refusals or reward model updates.
  • It produces evidence for risk assessments and model cards.

Good practice

  • Cover a taxonomy of harms so coverage is measurable, not ad hoc.
  • Track whether fixes generalize or just patch the specific prompt.
  • Re run red teaming after every major training change.

Key idea

Red teaming adversarially probes a model across a harm taxonomy, using human and automated attacks to surface rare severe failures that become training data and risk evidence.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the goal of red teaming an LLM?

2. Why is a harm taxonomy useful in red teaming?