Visualized Analysis about Residual SAEs for activation value and feature matching
Abstract
Sparse Autoencoders (SAEs) linearly disentangle interpretable features from a large language model's intermediate representations. However, the basic dynamics of SAEs—such as the activation values of SAE features and the encoder and decoder weights—have not been as extensively visualized as their implications. To shed light on the properties of feature activation values and the emergence of SAE features, I conducted a two-part visual analysis: (1) an analysis of SAE feature activations across token positions in comparison with other layers, and (2) a feature matching analysis across different SAEs based on decoder weights under diverse training settings. The first analysis revealed intriguing traits related to token positions and positional embeddings. The second analysis initially identified differences between encoder and decoder weights in feature matching, and examined the relative importance of factors such as the dataset, seed, SAE type, and dictionary size, all of which contribute to distinctive features across layers.
Introduction
The Sparse Autoencoder (SAE) architecture, introduced by Faruqui et al., has demonstrated the capacity to decompose interpretable features in a linear fashion (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023). SAE latent dimensions can be interpreted as monosemantic features by disentangling superpositioned neuron activations from the LLM's linear activations. This approach offers broad interpretability by reconstructing transformer residual streams (Gao et al., 2024), MLP activations (Bricken et al., 2023), and even dense word embeddings (O'Neill et al., 2024). Following its demonstrated efficiency, further exploration has uncovered novel architectures incorporating various activation functions (such as Top-K) and proposals for multi-level feature SAEs, including the Matryoshka SAE (Nabeshima, 2024; Bussmann, 2024).
…
Prelimineries
1.1 Mechanistic Interpretability
Mechanistic Interpretability seeks to reverse-engineer neural networks by analyzing their internal mechanisms and intermediate representations (Neel, 2021; Olah, 2022). This approach typically focuses on analyzing latent dimensions, leading to discoveries such as layer pattern features in CNN-based vision models (Olah et al., 2017; Cartern et al., 2019) and neuron-level features (Schubert et al., 2021; Goh et al., 2021). The success of the attention mechanism (Bahdanau et al., 2014; Parikh et al., 2016) and the Transformer model (Vaswani et al., 2017) has further spurred efforts to understand the emergent abilities of transformers (Wei, 2022).
1.2 Residual Stream
In transformer architectures, the residual stream—as described in Elhage et al.—is a continuous flow of fixed-dimensional vectors connected via residual connections. It serves as a communication channel between layers and attention heads (Elhage et al., 2021), making it a focal point of research on transformer capabilities (Olsson et al., 2022; Riggs, 2023).
1.3 Superposition Hypothesis
In neural network representations, the superposition of thought vectors (Goh, 2023) and word embeddings (Arora et al., 2018) has given rise to the superposition hypothesis. Using toy models, Elhage et al. detailed the emergence of the superposition hypothesis through the process of phase change in feature dimensionality, linking it to compressed sensing (Donoho, 2006; Bora et al., 2017). Additionally, activations in transformers are empirically found to be highly superpositioned (Gurnee et al., 2023). While this superposition effectively explains the operation of LLMs, its linearity remains a controversial topic (Mendel, 2024).
1.4 Linear Representation Hypothesis
In the vector representation space of neural networks, it is posited that neural networks exhibit linear directions in activation space (Mikolov et al., 2013). This has led to studies demonstrating that word embeddings reside in interpretable linear subspaces (Park et al., 2023) and that LLM representations are organized linearly (Elhage et al., 2022). Moreover, recent work by Wes Gurnee & Max Tegmark (2024) provides evidence for the linear representation hypothesis within a transformer's hidden states (residual stream). This hypothesis justifies the use of inner products, such as cosine similarity, directly in the latent space; in addition, Park et al., 2024 have proposed alternatives like the causal inner product.
Method
왜 같은 index act dist 다른지 - 뒤에서 400개 사용?
Weekly plan
- 내일 발렌타인 즐겁게 잘 지내기 예약한 두곳과 저녁 비프웰링턴 → scaling 논문읽기
- 주말에는 아마 미팅 스킵하고 sae activation 글 lesswrong 써서 주말 내로 올리기
- 민이 토욜 집가서 같이공부? scaling 논문읽기
- two part
- gpt2 huggingface - SAE figure
- gpt2 batchtopk - SAE matrix figure
- 다만 2개 테스트 다 geometric mean 적용 안해서 적용하면 비율 높아질수도
- Same SAE top-2 부터 해서 중첩되는거 많은거 아닌가 하는 의시
positional encoding 제거하고 돌려보기
l1 l2 loss invectivation
turorial 2.0 다른거 있나 확인, 논문읽기
로스 다시 렌더링
ce difference llm after reconstruciton
position embedding 뺐을 때 sae loss/ce loss
Feature Umap
최종그래프
layerwise similarity 2개인데, (umap visualization animation)
Citation 빼고 적기 (who, 2024) → footnote 는 내 글 적고 link 는 그냥 링크, 제목 dtatset 수정
font size 2배해서 다시 렌더링
figma
residual stream visualization
common feature matching visualization
Decoder weight UMAP, t-Sne - geometric mean initizliation
differenct dictionary size 일때 비교 larger or smaller
월욜에는 zekun 한테 corrsteer 확실히 방향 정해서 report share 하기
- 화욜에는 ir coursework 하루 진행
수욜 SAE feature RL 진행
- 목욜은 금욜 회의전 snlp 진행
crosscoder 논문읽고
- 금토일 nnet upload lesswrong Π-Net, TreeSAE n-Net
- encoder decoder force 하면 안되나 same symmetry
Nnet (with residual or not), synsae training → 완료 후 eluther embedding
Seonglae Cho