Constitutional AI

Creator

Creator

Seonglae Cho

Created

Created

2023 Nov 25 15:30

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 21 11:57

Refs

Refs

Self-improving without label

SL(Supervised Learning) phase

notion image

RL(Reinforcement Learning) phase

Anthropic on Twitter / X

In our paper, we describe how we’ve used Constitutional AI to train better and more harmless AI assistants without any human feedback labels for harms. This approach leads to models that are safer and also more helpful. pic.twitter.com/PXvWk3fz0o— Anthropic (@AnthropicAI) December 16, 2022

https://twitter.com/AnthropicAI/status/1603791168495489030

https://arxiv.org/pdf/2212.08073.pdf

Constitutional AI: Harmlessness from AI Feedback

We show that language models can learn to follow a set of simple, natural language principles via self-improvement, and we use this new method to train a more harmless assistant.

https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback

Constitutional AI: Harmlessness from AI Feedback

Constitutional Classifiers from
Anthropic AI
Constitutional AI

Heuristic rules

Constitutional Classifiers: Defending against universal jailbreaks

A paper from Anthropic describing a new way to guard LLMs against jailbreaking

https://www.anthropic.com/research/constitutional-classifiers

Constitutional Classifiers: Defending against universal jailbreaks

Backlinks

Defense Jailbreaking

Recommendations

///////