Automated Interpretability Techniques
Interpretability Meta Metrics
Automated interpretability is a difficult task for models that we have only begun working on, it has already been very useful for quickly understanding dictionary learning features in a scalable fashion.
Rubric for human annotator (evaluating interpretability description)
Natural language explanations based on individual neurons are not the best approach for LLM explainability. This claim begins with questions about whether natural language can faithfully describe how individual neurons store and process information. However, interventions on individual neurons showed no causal link between explanations and changes in model output.