Pragmatic Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 2 1:40
Editor
Edited
Edited
2025 Dec 2 1:40
Refs
Refs
 
 
 
 
 

Pragmatic Interpretability

The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.
Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.
 
 
 

Recommendations