Self-improving without label
SL(Supervised Learning) phase

RL(Reinforcement Learning) phase
Anthropic on Twitter / X
In our paper, we describe how we’ve used Constitutional AI to train better and more harmless AI assistants without any human feedback labels for harms. This approach leads to models that are safer and also more helpful. pic.twitter.com/PXvWk3fz0o— Anthropic (@AnthropicAI) December 16, 2022
https://twitter.com/AnthropicAI/status/1603791168495489030
arxiv.org
https://arxiv.org/pdf/2212.08073.pdf
Constitutional AI: Harmlessness from AI Feedback
We show that language models can learn to follow a set of simple, natural language principles via self-improvement, and we use this new method to train a more harmless assistant.
https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback

Constitutional Classifiers from Anthropic AI Constitutional AI
Heuristic rules
Constitutional Classifiers: Defending against universal jailbreaks
A paper from Anthropic describing a new way to guard LLMs against jailbreaking
https://www.anthropic.com/research/constitutional-classifiers


Seonglae Cho