In long contexts, mean pooling misses situations where "malicious tokens briefly appear somewhere," so the idea is: "when a recent malicious signal spikes, EMA rises and we capture that max." First, train a standard linear mean probe, then during inference, accumulate those per-token scores using exponential moving average (EMA) and use the maximum value (max) at the end.
arxiv.org
https://arxiv.org/pdf/2601.11516

Seonglae Cho