Make Constitution → Create dataset based on rubric and train classifier
Constitutional Classifiers: Defending against universal jailbreaks
A paper from Anthropic describing a new way to guard LLMs against jailbreaking
https://www.anthropic.com/research/constitutional-classifiers

Constitutional Classifier++: Linear probe
The first-generation classifier reduced jailbreak success rate from 86% → 4.4%, but introduced drawbacks: +23.7% computational cost and increased false rejection of benign queries. The new system uses a two-stage cascade architecture: a lightweight linear probe based on internal activations acts as the first filter, screening all traffic.
Only suspicious cases are escalated to a more powerful exchange classifier that examines both input and output. Additional computational cost: ~1%. False rejection rate for benign queries: 0.05% (87% reduction vs. previous). No universal jailbreaks discovered yet. By leveraging internal activations, the system is more robust against output obfuscation and reconstruction attacks, and works complementarily with external classifiers.
Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
https://www.anthropic.com/research/next-generation-constitutional-classifiers
Constitutional Classifiers++: Efficient Production-Grade Defenses...
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation...
https://arxiv.org/abs/2601.04603


Seonglae Cho