SPARSE FEATURE CIRCUITS:

Creator
Creator
Seonglae Cho
Created
Created
2024 Oct 24 11:15
Editor
Edited
Edited
2025 May 20 17:24

SFC

SHIFT (Sparse Human-Interpretable Feature Trimmin)
https://arxiv.org/pdf/2403.19647
 
 
 

Task Vector Cleaning
AI Task vector

The Task Vector Cleaning (TVC) algorithm refines "task vectors" into a small number of SAE features to extract core 'execution features' essential for task performance. The Sparse Feature Circuits (SFC) technique, extended and modified for the Gemma-1 2B model, identifies "task detection features" separate from execution features and reveals the causal connections between these two layers (detection→execution). This experimentally proves that in-context learning occurs in two stages: "detecting which task to perform"→"actual execution", primarily taking place in the middle layers' MLP and attention sublayers.
While
Induction head
covers the general pattern matching capabilities of
In-context learning
, this paper focuses on
Instruction Dataset
to explain the causal relationship between task detection and instruction following execution.

SHIFT

Sparse Human-Interpretable Feature Trimming
First serious attempt at circuit finding with SAEs
Neel Nanda
loves this
 
 

Recommendations