SAE Feature Circuit

Creator

Creator

Created

Created

2024 Jun 10 5:24

Editor

Editor

Edited

Edited

2025 Mar 23 21:27

Refs

Refs

Each feature mutually increases the token probability, creating a feature loop which sometime breaks the model capability without repetition penalty (
Halting Problem)

Types

Single node loop

notion image

Two-node system

notion image

Unicode prefix, suffix predictors (Tamil, Chinese)

notion image

Complex multi-node
Finite State Automata (HTML)

notion image

Finite State Automata

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features#phenomenology-fsa

Recommendations

////////////