Auto-Interp Score

Creator

Creator

Seonglae Cho

Created

Created

2025 Feb 19 13:22

Editor

Editor

Seonglae Cho

Edited

Edited

2025 May 20 17:26

Refs

Refs

LLM Neuron explainer

Automated Interpretability

After interpretaion, use LLM again

rubrics

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features#appendix-rubric

Recommendations

////////////