← Lessons

quiz vs the machine

Gold1370

Machine Learning

Constitutional AI and Self Critique

Using written principles and model self review to improve safety.

5 min read · core · beat Gold to climb

What it is

Constitutional AI trains a model to follow a written set of principles, a constitution, using the model's own critiques instead of heavy human labeling for harms. The model learns to revise its answers to better match the stated values.

The self critique loop

The first phase improves responses through self review.

  • The model produces an initial answer to a prompt.
  • It is asked to critique that answer against a principle, such as being helpful but not harmful.
  • It then revises the answer to address its own critique.
  • The revised pairs are used to fine tune the model.

Preference phase

A second phase replaces much of the human feedback in RLHF. The model compares two responses and judges which better follows the constitution, producing preference data automatically. That data trains a reward model or feeds preference optimization.

  • It scales oversight, since principles guide many cases at once.
  • It makes values explicit and editable in the constitution.
  • Humans still write the principles and audit the results.

Key idea

Constitutional AI uses written principles plus model self critique and revision to scale alignment, reducing reliance on human harm labels.

Check yourself

Answer to earn rating on the learn ladder.

1. What are the steps of the constitutional AI self critique loop?

2. How does constitutional AI reduce reliance on human harm labels?