LLM Neuron explainer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Sep 16 21:56
Editor
Edited
Edited
2025 Jun 20 13:42
Refs
https://openai.com/research/language-models-can-explain-neurons-in-language-models
The directions obtained through sparse coding showed higher interpretability compared to random directions, PCA, and ICA when used as neuron basis
 
 
 
 
 

2021 MIT

Natural language descriptions of deep visual features

2023 OpenAI

All of activations are quantized to be between 0-9 inclusive.

2024

Analyze the impact of a feature on model outputs by comparing the baseline output with the output after intervention
Gradual improvement with hypothesis Best-of-k sampling and small model by knowledge distillation
In order to get feature explanations Claude 2 is provided with a total of 49 examples: ten examples from the top activations interval; two from the other 12 intervals; five completely random examples; and ten examples where the top activating tokens appear in different contexts. Finally, we ask the model to be succinct in its answer and not provide specific examples of tokens it activates for.
Using the explanation generated, in a new interaction Claude is asked to predict activations for sixty examples: six from the top activations; two from the other 12 intervals; ten completely random; and twenty top activating tokens out of context. For the sake of computational efficiency, Claude scores all sixty examples in a single shot, repeating each token followed by its predicted activation. In an ideal setting, each example would be given independently as its own prompt.

Transluce

 
 
 

Recommendations