Single-token feature

Creator

Creator

Seonglae Cho

Created

Created

2025 Jan 30 1:15

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Feb 26 16:48

Refs

Refs

Usually combination of
Token-in-context feature

notion image

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features#features-seem-like-bugs-1

prevent (common in early layers)

Tokenized SAEs: Disentangling SAE Reconstructions

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models’ inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.

https://arxiv.org/html/2502.17332v1

Backlinks

AI Self-Explanation

Recommendations

//////////