Automated Interpretability

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 7 15:23
Editor
Edited
Edited
2025 Feb 19 13:22
Automated Interpretability Techniques
 
 
 
Automated interpretability is a difficult task for models that we have only begun working on, it has already been very useful for quickly understanding dictionary learning features in a scalable fashion.

Rubric for human annotator (evaluating interpretability description)

 
 

Recommendations