In long contexts, mean pooling misses situations where "malicious tokens briefly appear somewhere," so the idea is: "when a recent malicious signal spikes, EMA rises and we capture that max." First, train a standard linear mean probe, then during inference, accumulate those per-token scores using exponential moving average (EMA) and use the maximum value (max) at the end.
EMA Probe
Creator
Creator
Seonglae ChoCreated
Created
2026 Feb 10 18:0Editor
Editor
Seonglae ChoEdited
Edited
2026 Feb 13 12:51Refs
Refs
