Structure (aggregation) determines performance → MultiMax/Rolling/AlphaEvolve
Existing linear/EMA/mean probes have limitations in practice, → Should be replaced with new aggregation/architecture like MultiMax / Rolling Attn / AlphaEvolve. However, they can barely defend against adaptive attacks.
- Best probe test error ≈ 2.5%
- Jailbreak success rate (FNR) ≥ always remains at 1~3% or higher
- More vulnerable to ART/Adaptive attacks
Call LLM only when ambiguous → Reduce cost to 1/50 level while maintaining performance
- MultiMax → max pooling
- Rolling Attn → local window + attn
- AlphaEvolve → automatic structure search
arxiv.org
https://arxiv.org/pdf/2601.11516

Seonglae Cho