Internal Probe

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 13 12:49
Editor
Edited
Edited
2026 Feb 13 12:51
Refs
Refs
Internal Probes
 
 
 
 
 
 

Structure (aggregation) determines performance → MultiMax/Rolling/AlphaEvolve

Existing linear/EMA/mean probes have limitations in practice, → Should be replaced with new aggregation/architecture like MultiMax / Rolling Attn / AlphaEvolve. However, they can barely defend against adaptive attacks.
  • Best probe test error ≈ 2.5%
  • Jailbreak success rate (FNR) ≥ always remains at 1~3% or higher
  • More vulnerable to ART/Adaptive attacks
Call LLM only when ambiguous → Reduce cost to 1/50 level while maintaining performance
  • MultiMax → max pooling
  • Rolling Attn → local window + attn
  • AlphaEvolve → automatic structure search
arxiv.org

Constitutional Classifier++: Linear probe

Holds up relatively well even against adaptive attacks unlike
Gemini Probe
The first-generation classifier reduced jailbreak success rate from 86% → 4.4%, but introduced drawbacks: +23.7% computational cost and increased false rejection of benign queries. The new system uses a two-stage cascade architecture: a lightweight linear probe based on internal activations acts as the first filter, screening all traffic.
Only suspicious cases are escalated to a more powerful exchange classifier that examines both input and output. Additional computational cost: ~1%. False rejection rate for benign queries: 0.05% (87% reduction vs. previous). No universal jailbreaks discovered yet. By leveraging internal activations, the system is more robust against output obfuscation and reconstruction attacks, and works complementarily with external classifiers.
Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
Constitutional Classifiers++: Efficient Production-Grade Defenses...
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation...
Constitutional Classifiers++: Efficient Production-Grade Defenses...
 
 
 

Recommendations