SAE Feature Matching

Match Features Across Layers

대부분의 경우 계층 간의 feature는 순서만 바뀌거나 약간의 변형만 있을 뿐

계층이 가까울수록 동일한 feature가 많이 유지되며, 순서만 바뀌는 비율이 높은 편

하지만 멀어지면 MSE값이 높아지면서 새로운 feature도 분명히 나타난다.

Methods

Activation Correlating - high cost

Weight Cosine Similarity - low cost

Mechanistic Permutability

Decoder Weight MSE

arxiv.org

https://arxiv.org/pdf/2410.07656v2

Evolution though layer

Activation Correlating

Features contain more specialized information as they progress to higher layers.

Features emerge as logical combinations (AND/OR).

Some features are simply passed through layers (inefficiency)

Evolution of SAE Features Across Layers in LLMs

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

https://arxiv.org/html/2410.08869v2

SAE feature flow along the layers

Weight Cosine Similarity

arxiv.org

https://arxiv.org/pdf/2502.03032

Cross-Layer Feature Alignment and Steering in Large Language Model — LessWrong

Each layer of a LLM encodes its own set of learned concepts. By utilizing SAE features, we can uncover their lifespan and examine how information processing unfolds throughout the model. This approach enhances our understanding of LLMs and improves our ability to control them.

https://www.lesswrong.com/posts/feknAa3hQgLG2ZAna/cross-layer-feature-alignment-and-steering-in-large-language-2

Depends on seed and dataset (
SAE Training)

Weight Cosine Similarity

Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.

https://anonymous.4open.science/r/sae_overlap-51F1/README.md

seed - weight initialization matters →
SAE weight initialization helps to prevent this issue

Weight cosine similarity +
Hungarian Matching

It sets 1 - cosine similarity matrix to cost matrix and applies

Hungarian Matching to find optimal 1:1 matching

arxiv.org

https://arxiv.org/pdf/2501.16615

dataset - matters more than seed

SAE Feature Matching

Match Features Across Layers

Methods

Mechanistic Permutability

Evolution though layer

SAE feature flow along the layers

Depends on seed and dataset (
SAE Training)

Backlinks

Recommendations

SAE Feature Matching

Match Features Across Layers

Methods

Mechanistic Permutability

Evolution though layer

SAE feature flow along the layers

Depends on seed and dataset (SAE Training)

Backlinks

Recommendations

Depends on seed and dataset (
SAE Training)