SAE Match

Creator

Creator

Seonglae Cho

Created

Created

2024 Oct 24 11:30

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Oct 24 11:34

Refs

Refs

Residual Stream

Match Features Across Layers

대부분의 경우 계층 간의 feature는 순서만 바뀌거나 약간의 변형만 있을 뿐

계층이 가까울수록 동일한 feature가 많이 유지되며, 순서만 바뀌는 비율이 높은 편

하지만 멀어지면 MSE값이 높아지면서 새로운 feature도 분명히 나타난다.

Mechanistic Permutability

Mechanistic Permutability: Match Features Across Layers

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

Mechanistic Permutability: Match Features Across Layers

https://arxiv.org/html/2410.07656v2

Recommendations

///////////