SFC
SHIFT (Sparse Human-Interpretable Feature Trimmin)

Task Vector Cleaning AI Task vector
The Task Vector Cleaning (TVC) algorithm refines "task vectors" into a small number of SAE features to extract core 'execution features' essential for task performance. The Sparse Feature Circuits (SFC) technique, extended and modified for the Gemma-1 2B model, identifies "task detection features" separate from execution features and reveals the causal connections between these two layers (detection→execution). This experimentally proves that in-context learning occurs in two stages: "detecting which task to perform"→"actual execution", primarily taking place in the middle layers' MLP and attention sublayers.
While Induction head covers the general pattern matching capabilities of In-context learning, this paper focuses on Instruction Dataset to explain the causal relationship between task detection and instruction following execution.
www.arxiv.org
https://www.arxiv.org/pdf/2504.13756
SHIFT
Sparse Human-Interpretable Feature Trimming
First serious attempt at circuit finding with SAEs
arxiv.org
https://arxiv.org/pdf/2403.19647
Neel Nanda loves this
Sparse Feature Circuits: Discovering and Editing Interpretable...
We introduce methods for discovering and applying **sparse feature circuits**. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors....
https://openreview.net/forum?id=I4e82CIDxv


Seonglae Cho