Mechanistic Probing
Mechanistic Auditing Methods
Full averaging (Not SAE, Residual Stream)
There are several design choices to consider:
- After/Before regression score
- Max/Mean pooling across tokens
For after-regression context, averaging probe scores across a response yielded the best results.
Prompts vs activation monitoring (OpenAI)
With small data, suffix-only prompted last-token probing is the most data-efficient, while with large data: SAE max-pooled probing outperforms raw probing and shows similar performance to prompted probing. SAE proving outperformed raw activation proving