SFC
SHIFT (Sparse Human-Interpretable Feature Trimmin)
Task Vector Cleaning AI Task vector
The Task Vector Cleaning (TVC) algorithm refines "task vectors" into a small number of SAE features to extract core 'execution features' essential for task performance. The Sparse Feature Circuits (SFC) technique, extended and modified for the Gemma-1 2B model, identifies "task detection features" separate from execution features and reveals the causal connections between these two layers (detection→execution). This experimentally proves that in-context learning occurs in two stages: "detecting which task to perform"→"actual execution", primarily taking place in the middle layers' MLP and attention sublayers.
While Induction head covers the general pattern matching capabilities of In-context learning, this paper focuses on Instruction Dataset to explain the causal relationship between task detection and instruction following execution.
SHIFT
Sparse Human-Interpretable Feature Trimming
First serious attempt at circuit finding with SAEs
Neel Nanda loves this