Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation.
As the quality of explanations varies depending on the insertion vector scale, we combine self-similarity and entropy metrics to automatically search for the optimal scale.
Verification shows similar or superior interpretation accuracy compared to the LLM Neuron explainer method.
Limitation
Model inherent bias significantly affects the quality of descriptions.
Future Works
As expected, for Single-token feature (one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens.
Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features.
SelfIE for Language Model Embeddings
Patchscopes: Inspecting Hidden Representations
SAE Self-Explanation for SAE Features
InversionView
separated decoder training for interpretability