Internal Probes
Structure (aggregation) determines performance → MultiMax/Rolling/AlphaEvolve
Existing linear/EMA/mean probes have limitations in practice, → Should be replaced with new aggregation/architecture like MultiMax / Rolling Attn / AlphaEvolve. However, they can barely defend against adaptive attacks.
- Best probe test error ≈ 2.5%
- Jailbreak success rate (FNR) ≥ always remains at 1~3% or higher
- More vulnerable to ART/Adaptive attacks
Call LLM only when ambiguous → Reduce cost to 1/50 level while maintaining performance
- MultiMax → max pooling
- Rolling Attn → local window + attn
- AlphaEvolve → automatic structure search
arxiv.org
https://arxiv.org/pdf/2601.11516
Constitutional Classifier++: Linear probe
Holds up relatively well even against adaptive attacks unlike Gemini Probe
The first-generation classifier reduced jailbreak success rate from 86% → 4.4%, but introduced drawbacks: +23.7% computational cost and increased false rejection of benign queries. The new system uses a two-stage cascade architecture: a lightweight linear probe based on internal activations acts as the first filter, screening all traffic.
Only suspicious cases are escalated to a more powerful exchange classifier that examines both input and output. Additional computational cost: ~1%. False rejection rate for benign queries: 0.05% (87% reduction vs. previous). No universal jailbreaks discovered yet. By leveraging internal activations, the system is more robust against output obfuscation and reconstruction attacks, and works complementarily with external classifiers.
Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
https://www.anthropic.com/research/next-generation-constitutional-classifiers
Constitutional Classifiers++: Efficient Production-Grade Defenses...
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation...
https://arxiv.org/abs/2601.04603


Seonglae Cho