Constitutional Classifier

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Feb 9 16:33
Editor
Edited
Edited
2026 Jan 14 22:7
 
 
 
 
Make Constitution → Create dataset based on rubric and train classifier

Constitutional Classifier++: Linear probe

The first-generation classifier reduced jailbreak success rate from 86% → 4.4%, but introduced drawbacks: +23.7% computational cost and increased false rejection of benign queries. The new system uses a two-stage cascade architecture: a lightweight linear probe based on internal activations acts as the first filter, screening all traffic.
Only suspicious cases are escalated to a more powerful exchange classifier that examines both input and output. Additional computational cost: ~1%. False rejection rate for benign queries: 0.05% (87% reduction vs. previous). No universal jailbreaks discovered yet. By leveraging internal activations, the system is more robust against output obfuscation and reconstruction attacks, and works complementarily with external classifiers.
 

Recommendations