Each feature mutually increases the token probability, creating a feature loop which sometime breaks the model capability without repetition penalty (Halting Problem)
Types
Single node loop

Two-node system

- Unicode prefix, suffix predictors (Tamil, Chinese)

Complex multi-node Finite State Automata (HTML)

Finite State Automata
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
https://transformer-circuits.pub/2023/monosemantic-features#phenomenology-fsa

Seonglae Cho