Automated Interpretability

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 7 15:23
Editor
Edited
Edited
2025 Jul 1 15:49
Automated Interpretability Techniques
 
 
Interpretability Meta Metrics
 
 
 
 
 
 
Automated interpretability is a difficult task for models that we have only begun working on, it has already been very useful for quickly understanding dictionary learning features in a scalable fashion.

Rubric for human annotator (evaluating interpretability description)

Natural language explanations based on individual neurons are not the best approach for LLM explainability. This claim begins with questions about whether natural language can faithfully describe how individual neurons store and process information. However, interventions on individual neurons showed no causal link between explanations and changes in model output.
 
 

Recommendations