Mechanistic AI Auditing

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 23 9:49
Editor
Edited
Edited
2025 Jul 23 9:50
Refs
Refs

Mechanistic Probing

Mechanistic Auditing Methods
 
 
 
 

Full averaging (Not SAE, Residual Stream)

There are several design choices to consider:
  • After/Before regression score
  • Max/Mean pooling across tokens
For after-regression context, averaging probe scores across a response yielded the best results.

Prompts vs activation monitoring (OpenAI)

With small data, suffix-only prompted last-token probing is the most data-efficient, while with large data: SAE max-pooled probing outperforms raw probing and shows similar performance to prompted probing. SAE proving outperformed raw activation proving
 
 

Recommendations