Self-Explanation

Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation.

As the quality of explanations varies depending on the insertion vector scale, we combine self-similarity and entropy metrics to automatically search for the optimal scale.

Verification shows similar or superior interpretation accuracy compared to the

LLM Neuron explainer method.

Limitation

Model inherent bias significantly affects the quality of descriptions.

Future Works

As expected, for

Single-token feature (one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens.

Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features.

SelfIE for Language Model Embeddings

SelfIE: Self-Interpretation of Large Language Model Embeddings

How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We...