Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Neuron SAE/SAE Feature/
SAE Feature Circuit
Search

SAE Feature Circuit

Creator
Creator
Seonglae Cho
Created
Created
2024 Jun 10 5:24
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Mar 23 21:27
Refs
Refs
AI Circuit

Each feature mutually increases the token probability, creating a feature loop which sometime breaks the model capability without repetition penalty (
Halting Problem
)

 

Types

Single node loop
notion image
Two-node system
notion image
  • Unicode prefix, suffix predictors (Tamil, Chinese)
notion image

Complex multi-node
Finite State Automata
(HTML)

notion image
 
 
 
 

Finite State Automata

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
https://transformer-circuits.pub/2023/monosemantic-features#phenomenology-fsa
 
 

Table of Contents
Each feature mutually increases the token probability, creating a feature loop which sometime breaks the model capability without repetition penalty ()TypesComplex multi-node (HTML)

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Neuron SAE/SAE Feature/
SAE Feature Circuit
Copyright Seonglae Cho