Mechanistic Probing
Mechanistic Auditing Methods
Full averaging (Not SAE, Residual Stream)
There are several design choices to consider:
- After/Before regression score
- Max/Mean pooling across tokens
For after-regression context, averaging probe scores across a response yielded the best results.
arxiv.org
https://arxiv.org/pdf/2507.12691
Prompts vs activation monitoring (OpenAI)
With small data, suffix-only prompted last-token probing is the most data-efficient, while with large data: SAE max-pooled probing outperforms raw probing and shows similar performance to prompted probing. SAE proving outperformed raw activation proving
arxiv.org
https://arxiv.org/pdf/2504.20271

Seonglae Cho