What it is
Constitutional AI trains a model to follow a written set of principles, a constitution, using the model's own critiques instead of heavy human labeling for harms. The model learns to revise its answers to better match the stated values.
The self critique loop
The first phase improves responses through self review.
- The model produces an initial answer to a prompt.
- It is asked to critique that answer against a principle, such as being helpful but not harmful.
- It then revises the answer to address its own critique.
- The revised pairs are used to fine tune the model.
Preference phase
A second phase replaces much of the human feedback in RLHF. The model compares two responses and judges which better follows the constitution, producing preference data automatically. That data trains a reward model or feeds preference optimization.
- It scales oversight, since principles guide many cases at once.
- It makes values explicit and editable in the constitution.
- Humans still write the principles and audit the results.
Key idea
Constitutional AI uses written principles plus model self critique and revision to scale alignment, reducing reliance on human harm labels.