Sparse Feature Circuit

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 11:15
Editor
Edited
Edited
2026 Feb 5 16:21

SFC

SHIFT (Sparse Human-Interpretable Feature Trimmin)
https://arxiv.org/pdf/2403.19647
 
 
 

Task Vector Cleaning
AI Task vector

The Task Vector Cleaning (TVC) algorithm refines "task vectors" into a small number of SAE features to extract core 'execution features' essential for task performance. The Sparse Feature Circuits (SFC) technique, extended and modified for the Gemma-1 2B model, identifies "task detection features" separate from execution features and reveals the causal connections between these two layers (detection→execution). This experimentally proves that in-context learning occurs in two stages: "detecting which task to perform"→"actual execution", primarily taking place in the middle layers' MLP and attention sublayers.
While
Induction head
covers the general pattern matching capabilities of
In-context learning
, this paper focuses on
Instruction Dataset
to explain the causal relationship between task detection and instruction following execution.
www.arxiv.org

SHIFT

Sparse Human-Interpretable Feature Trimming
First serious attempt at circuit finding with SAEs
arxiv.org
Neel Nanda
loves this
Sparse Feature Circuits: Discovering and Editing Interpretable...
We introduce methods for discovering and applying **sparse feature circuits**. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors....
Sparse Feature Circuits: Discovering and Editing Interpretable...
 
 

Backlinks

Samuel Marks

Recommendations