Internal Probe

Internal Probes

Structure (aggregation) determines performance → MultiMax/Rolling/AlphaEvolve

Existing linear/EMA/mean probes have limitations in practice, → Should be replaced with new aggregation/architecture like MultiMax / Rolling Attn / AlphaEvolve. However, they can barely defend against adaptive attacks.

Best probe test error ≈ 2.5%

Jailbreak success rate (FNR) ≥ always remains at 1~3% or higher

More vulnerable to ART/Adaptive attacks

Call LLM only when ambiguous → Reduce cost to 1/50 level while maintaining performance

MultiMax → max pooling

Rolling Attn → local window + attn

AlphaEvolve → automatic structure search

arxiv.org

https://arxiv.org/pdf/2601.11516

Constitutional Classifier++: Linear probe

Holds up relatively well even against adaptive attacks unlike

Gemini Probe

The first-generation classifier reduced jailbreak success rate from 86% → 4.4%, but introduced drawbacks: +23.7% computational cost and increased false rejection of benign queries. The new system uses a two-stage cascade architecture: a lightweight linear probe based on internal activations acts as the first filter, screening all traffic.

Only suspicious cases are escalated to a more powerful exchange classifier that examines both input and output. Additional computational cost: ~1%. False rejection rate for benign queries: 0.05% (87% reduction vs. previous). No universal jailbreaks discovered yet. By leveraging internal activations, the system is more robust against output obfuscation and reconstruction attacks, and works complementarily with external classifiers.

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

https://www.anthropic.com/research/next-generation-constitutional-classifiers

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Constitutional Classifiers++: Efficient Production-Grade Defenses...

We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation...

https://arxiv.org/abs/2601.04603

Internal Probe

Structure (aggregation) determines performance → MultiMax/Rolling/AlphaEvolve

Constitutional Classifier++: Linear probe

Recommendations