SAE Feature Matching

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 11:30
Editor
Edited
Edited
2025 Feb 24 1:14

Match Features Across Layers

  • 대부분의 경우 계층 간의 feature는 순서만 바뀌거나 약간의 변형만 있을 뿐
  • 계층이 가까울수록 동일한 feature가 많이 유지되며, 순서만 바뀌는 비율이 높은 편
  • 하지만 멀어지면 MSE값이 높아지면서 새로운 feature도 분명히 나타난다.

Methods

  • Activation Correlating - high cost
  • Weight Cosine Similarity - low cost
 
 

Mechanistic Permutability

Decoder Weight MSE
arxiv.org

Evolution though layer

Activation Correlating
  • Features contain more specialized information as they progress to higher layers.
  • Features emerge as logical combinations (AND/OR).
  • Some features are simply passed through layers (inefficiency)
Evolution of SAE Features Across Layers in LLMs
Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

SAE feature flow along the layers

Weight Cosine Similarity
arxiv.org
Cross-Layer Feature Alignment and Steering in Large Language Model — LessWrong
Each layer of a LLM encodes its own set of learned concepts. By utilizing SAE features, we can uncover their lifespan and examine how information processing unfolds throughout the model. This approach enhances our understanding of LLMs and improves our ability to control them.
Cross-Layer Feature Alignment and Steering in Large Language Model — LessWrong

Depends on seed and dataset (
SAE Training
)

Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
  • dataset - matters more than seed
 
 
 
 

Recommendations