Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation.
As the quality of explanations varies depending on the insertion vector scale, we combine self-similarity and entropy metrics to automatically search for the optimal scale.
Verification shows similar or superior interpretation accuracy compared to the LLM Neuron explainer method.
Limitation
Model inherent bias significantly affects the quality of descriptions.
Future Works
As expected, for Single-token feature (one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens.
Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features.
SelfIE for Language Model Embeddings
SelfIE: Self-Interpretation of Large Language Model Embeddings
How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We...
https://arxiv.org/abs/2403.10949

Patchscopes: Inspecting Hidden Representations
Patchscopes: A Unifying Framework for Inspecting Hidden...
Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in...
https://arxiv.org/abs/2401.06102

SAE Self-Explanation for SAE Features
Self-explaining SAE features — LessWrong
TL;DR * We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residua…
https://www.lesswrong.com/posts/8ev6coxChSWcxCDy8/self-explaining-sae-features
InversionView
separated decoder training for interpretability

Seonglae Cho