Constitutional AI & Rule-Based Alignment

Explore further

Informs

Big Questions

Introduction & Foundations

Go Wider

Adaptation & Alignment

How AI Works

Required By

Safety, Alignment & Existential Risk

Governance, Power & Society

Implemented By

Alignment Research & Techniques

Governance, Power & Society

Implemented By

Guardrails & Safety Layers

How AI Works

Constitutional AI (CAI) takes a different approach to alignment: instead of relying primarily on human evaluators to judge outputs, you give the model a set of explicit principles - a "constitution" - and train it to critique and revise its own outputs according to those principles. The model learns to ask itself questions like "Is this response helpful? Is it honest? Could it cause harm?" and adjust accordingly. This approach was pioneered by Anthropic and has several advantages. It's more scalable than having humans evaluate every output. It makes the alignment criteria transparent and auditable - you can read the constitution and understand what the model is trying to optimise for. And it reduces the extent to which the model's behaviour depends on the specific preferences of individual human evaluators. Rule-based alignment more broadly encompasses any approach where explicit, written policies guide model behaviour - content policies, safety guidelines and output formatting rules. For organisations deploying AI, the constitutional approach is relevant because it offers a model for how they might specify and enforce their own standards. Rather than hoping the model's default behaviour aligns with business policies, you can work towards systems that explicitly encodes these requirements.