EMA Probe

Creator

Seonglae Cho

Created

2026 Feb 10 18:0

Editor

Seonglae Cho

Edited

2026 Mar 13 18:36

Refs

In long contexts, mean pooling misses situations where "malicious tokens briefly appear somewhere," so the idea is: "when a recent malicious signal spikes, EMA rises and we capture that max." First, train a standard linear mean probe, then during inference, accumulate those per-token scores using exponential moving average (EMA) and use the maximum value (max) at the end.

arxiv.org

https://arxiv.org/pdf/2601.11516

Recommendations

///////////