LLM as Neuron explainer

Creator

Creator

Seonglae Cho

Created

Created

2024 Sep 16 21:56

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Oct 24 23:18

Refs

Refs

2021 MIT

Natural language descriptions of deep visual features

https://arxiv.org/pdf/2201.11114.pdf

2023

Language models can explain neurons in language models

Methodology: Nick effectively started the project by having the initial idea to have GPT-4 explain neurons, and showing a simple explanation methodology worked. William came up with the initial simulation and scoring methodology and implementation. Dan and Steven ran many experiments resulting in ultimate choices of prompts and explanation/scoring parameters.

https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

Language models can explain neurons in language models

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

Language models can explain neurons in language models

https://openai.com/research/language-models-can-explain-neurons-in-language-models

Language models can explain neurons in language models

2024

intervention scoring

sae-auto-interp

EleutherAI • Updated 2024 Nov 23 8:21

EleutherAI • Updated 2024 Dec 11 6:38

기본 상태에서의 출력과 개입 후 출력을 비교하여, 해당 특징이 모델의 출력에 미친 영향을 분석

https://huggingface.co/datasets/EleutherAI/auto_interp_explanations/tree/main

https://arxiv.org/pdf/2410.13928

Open Source Automated Interpretability for Sparse Autoencoder Features

Building and evaluating an open-source pipeline for auto-interpretability

Open Source Automated Interpretability for Sparse Autoencoder Features

https://blog.eleuther.ai/autointerp/

Open Source Automated Interpretability for Sparse Autoencoder Features

Gradual improvement with hypothesis Best-of-k sampling and small model by knowledge distillation

Scaling Automatic Neuron Description | Transluce AI

We are releasing a database of descriptions of every neuron inside Llama-3.1-8B-Instruct, and weights of an explainer model finetuned to produce them. These descriptions have similar quality to a human expert on automated metrics, and can be generated inexpensively using an 8B-parameter model. These high-quality descriptions allow us to query and steer representations in natural languge, enabling applications such as our observability interface.

Scaling Automatic Neuron Description | Transluce AI

https://transluce.org/neuron-descriptions

Recommendations

//////////