Self-Explanation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 May 20 17:26
Editor
Edited
Edited
2025 Jul 14 13:52
Refs
Refs
Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation.
As the quality of explanations varies depending on the insertion vector scale, we combine self-similarity and entropy metrics to automatically search for the optimal scale.
Verification shows similar or superior interpretation accuracy compared to the
LLM Neuron explainer
method.

Limitation

Model inherent bias significantly affects the quality of descriptions.

Future Works

As expected, for
Single-token feature
(one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens.
Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features.
 
 
 
SelfIE for Language Model Embeddings
Patchscopes: Inspecting Hidden Representations
SAE Self-Explanation for SAE Features

InversionView

separated decoder training for interpretability
 
 

Recommendations