The Constitutional AI

How a written set of principles lets a model critique and revise itself with less human labeling.

Reducing human labeling for safety

Constitutional AI trains a model to be harmless using a short written set of principles, a constitution, instead of large volumes of human harm labels. The model uses these principles to supervise itself.

The two phases

Supervised phase: the model answers prompts, then is asked to critique its own answer against a principle and revise it. Revised answers form new training data.
Reinforcement phase: instead of humans, the model itself compares response pairs against the constitution to build preference data, sometimes called RLAIF.

Why principles help

The constitution makes the target behavior explicit and inspectable rather than buried in labels.
Self critique scales cheaply, since the model generates its own improvements.
Principles can be revised and the model retrained, giving transparent control.

Caveats

The model may misapply or rationalize around principles.
AI generated preferences inherit the model's existing biases.
Human oversight is still needed to write and audit the constitution.

Key idea

Constitutional AI uses an explicit written set of principles to drive self critique, revision, and AI generated preferences, scaling harmlessness training with far less human harm labeling.

The Constitutional AI

Reducing human labeling for safety

The two phases

Why principles help

Caveats

Key idea

Check yourself