Mechanistic AI Auditing

Mechanistic Probing

Mechanistic Auditing Methods

SAE Probing

Full averaging (Not SAE, Residual Stream)

There are several design choices to consider:

After/Before regression score

Max/Mean pooling across tokens

For after-regression context, averaging probe scores across a response yielded the best results.

arxiv.org

https://arxiv.org/pdf/2507.12691

Prompts vs activation monitoring (OpenAI)

With small data, suffix-only prompted last-token probing is the most data-efficient, while with large data: SAE max-pooled probing outperforms raw probing and shows similar performance to prompted probing. SAE proving outperformed raw activation proving

arxiv.org

https://arxiv.org/pdf/2504.20271

Mechanistic AI Auditing

Mechanistic Probing

Full averaging (Not SAE, Residual Stream)

Prompts vs activation monitoring (OpenAI)

Backlinks

Recommendations