Self-improving without label
SL(Supervised Learning) phase

RL(Reinforcement Learning) phase
Anthropic on Twitter / X
In our paper, we describe how we’ve used Constitutional AI to train better and more harmless AI assistants without any human feedback labels for harms. This approach leads to models that are safer and also more helpful. pic.twitter.com/PXvWk3fz0o— Anthropic (@AnthropicAI) December 16, 2022
https://twitter.com/AnthropicAI/status/1603791168495489030
arxiv.org
https://arxiv.org/pdf/2212.08073.pdf
Constitutional AI: Harmlessness from AI Feedback
We show that language models can learn to follow a set of simple, natural language principles via self-improvement, and we use this new method to train a more harmless assistant.
https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback

Constitutional Classifiers from Anthropic AI Constitutional AI
Heuristic rules
Constitutional Classifiers: Defending against universal jailbreaks
A paper from Anthropic describing a new way to guard LLMs against jailbreaking
https://www.anthropic.com/research/constitutional-classifiers

constitutional character persona AI
Context Awareness: Constitutional AI can mitigate Emergent Misalignement — LessWrong
We investigate whether Constitutional AI-style character training can increase robustness to Emergent Misalignment (EM). We take 11 character-trained…
https://www.lesswrong.com/posts/yA2hquLrFFSFDtcoE/context-awareness-constitutional-ai-can-mitigate-emergent

Seonglae Cho