Match Features Across Layers
- 대부분의 경우 계층 간의 feature는 순서만 바뀌거나 약간의 변형만 있을 뿐
- 계층이 가까울수록 동일한 feature가 많이 유지되며, 순서만 바뀌는 비율이 높은 편
- 하지만 멀어지면 MSE값이 높아지면서 새로운 feature도 분명히 나타난다.
Methods
- Activation Correlating - high cost
- Weight Cosine Similarity - low cost
Mechanistic Permutability
Decoder Weight MSE
Evolution though layer
Activation Correlating
- Features contain more specialized information as they progress to higher layers.
- Features emerge as logical combinations (AND/OR).
- Some features are simply passed through layers (inefficiency)
Evolution of SAE Features Across Layers in LLMs
Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a
graph visualization interface
for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.
https://arxiv.org/html/2410.08869v2
SAE feature flow along the layers
Weight Cosine Similarity
Cross-Layer Feature Alignment and Steering in Large Language Model — LessWrong
Each layer of a LLM encodes its own set of learned concepts. By utilizing SAE features, we can uncover their lifespan and examine how information processing unfolds throughout the model. This approach enhances our understanding of LLMs and improves our ability to control them.
https://www.lesswrong.com/posts/feknAa3hQgLG2ZAna/cross-layer-feature-alignment-and-steering-in-large-language-2
Depends on seed and dataset (SAE Training)
Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
- seed - weight initialization matters → SAE weight initialization helps to prevent this issue
- Weight cosine similarity + Hungarian Matching
It sets 1 - cosine similarity matrix to cost matrix and applies Hungarian Matching to find optimal 1:1 matching
- dataset - matters more than seed

Seonglae Cho