← Lessons

quiz vs the machine

Gold1420

Machine Learning

The Constitutional AI

How a written set of principles lets a model critique and revise itself with less human labeling.

6 min read · core · beat Gold to climb

Reducing human labeling for safety

Constitutional AI trains a model to be harmless using a short written set of principles, a constitution, instead of large volumes of human harm labels. The model uses these principles to supervise itself.

The two phases

  • Supervised phase: the model answers prompts, then is asked to critique its own answer against a principle and revise it. Revised answers form new training data.
  • Reinforcement phase: instead of humans, the model itself compares response pairs against the constitution to build preference data, sometimes called RLAIF.

Why principles help

  • The constitution makes the target behavior explicit and inspectable rather than buried in labels.
  • Self critique scales cheaply, since the model generates its own improvements.
  • Principles can be revised and the model retrained, giving transparent control.

Caveats

  • The model may misapply or rationalize around principles.
  • AI generated preferences inherit the model's existing biases.
  • Human oversight is still needed to write and audit the constitution.

Key idea

Constitutional AI uses an explicit written set of principles to drive self critique, revision, and AI generated preferences, scaling harmlessness training with far less human harm labeling.

Check yourself

Answer to earn rating on the learn ladder.

1. What replaces large volumes of human harm labels in Constitutional AI?

2. What happens in the supervised phase of Constitutional AI?

3. What is a caveat of AI generated preferences?