AI Self-Explanation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 May 20 17:26
Editor
Edited
Edited
2025 Nov 12 17:2
Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation.
As the quality of explanations varies depending on the insertion vector scale, we combine self-similarity and entropy metrics to automatically search for the optimal scale.
Verification shows similar or superior interpretation accuracy compared to the
LLM Neuron explainer
method.

Limitation

Model inherent bias significantly affects the quality of descriptions.

Future Works

As expected, for
Single-token feature
(one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens.
Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features.
 
 
 
SelfIE for Language Model Embeddings
SelfIE: Self-Interpretation of Large Language Model Embeddings
How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We...
SelfIE: Self-Interpretation of Large Language Model Embeddings
Patchscopes: Inspecting Hidden Representations
Patchscopes: A Unifying Framework for Inspecting Hidden...
Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in...
Patchscopes: A Unifying Framework for Inspecting Hidden...
SAE Self-Explanation for SAE Features
Self-explaining SAE features — LessWrong
TL;DR * We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residua…
Self-explaining SAE features — LessWrong

InversionView

separated decoder training for interpretability
openreview.net
 
 

Recommendations