Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/Machine Learning/Reinforcement Learning/Language Model RL/
Constitutional AI
Search

Constitutional AI

Creator
Creator
Seonglae Cho
Created
Created
2023 Nov 25 15:30
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Mar 21 11:57
Refs
Refs
RLAIF
Self-improving without label
 

SL(Supervised Learning) phase

notion image
 
 

RL(Reinforcement Learning) phase

 
 
 
 
Anthropic on Twitter / X
In our paper, we describe how we’ve used Constitutional AI to train better and more harmless AI assistants without any human feedback labels for harms. This approach leads to models that are safer and also more helpful. pic.twitter.com/PXvWk3fz0o— Anthropic (@AnthropicAI) December 16, 2022
Anthropic on Twitter / X
https://twitter.com/AnthropicAI/status/1603791168495489030
arxiv.org
https://arxiv.org/pdf/2212.08073.pdf
Constitutional AI: Harmlessness from AI Feedback
We show that language models can learn to follow a set of simple, natural language principles via self-improvement, and we use this new method to train a more harmless assistant.
Constitutional AI: Harmlessness from AI Feedback
https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback
Constitutional AI: Harmlessness from AI Feedback

Constitutional Classifiers from
Anthropic AI
Constitutional AI

Heuristic rules
Constitutional Classifiers: Defending against universal jailbreaks
A paper from Anthropic describing a new way to guard LLMs against jailbreaking
Constitutional Classifiers: Defending against universal jailbreaks
https://www.anthropic.com/research/constitutional-classifiers
Constitutional Classifiers: Defending against universal jailbreaks
 
 

Backlinks

Defense Jailbreaking

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/Machine Learning/Reinforcement Learning/Language Model RL/
Constitutional AI
Copyright Seonglae Cho