드디어 2 Visualized analysis LessWrong 제대로 적기

Visualized Analysis about Residual SAEs for activation value and feature matching

Abstract

Sparse Autoencoders (SAEs) linearly disentangle interpretable features from a large language model's intermediate representations. However, the basic dynamics of SAEs—such as the activation values of SAE features and the encoder and decoder weights—have not been as extensively visualized as their implications. To shed light on the properties of feature activation values and the emergence of SAE features, I conducted a two-part visual analysis: (1) an analysis of SAE feature activations across token positions in comparison with other layers, and (2) a feature matching analysis across different SAEs based on decoder weights under diverse training settings. The first analysis revealed intriguing traits related to token positions and positional embeddings. The second analysis initially identified differences between encoder and decoder weights in feature matching, and examined the relative importance of factors such as the dataset, seed, SAE type, and dictionary size, all of which contribute to distinctive features across layers.

Introduction

The Sparse Autoencoder (SAE) architecture, introduced by Faruqui et al., has demonstrated the capacity to decompose interpretable features in a linear fashion (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023). SAE latent dimensions can be interpreted as monosemantic features by disentangling superpositioned neuron activations from the LLM's linear activations. This approach offers broad interpretability by reconstructing transformer residual streams (Gao et al., 2024), MLP activations (Bricken et al., 2023), and even dense word embeddings (O'Neill et al., 2024). Following its demonstrated efficiency, further exploration has uncovered novel architectures incorporating various activation functions (such as Top-K) and proposals for multi-level feature SAEs, including the Matryoshka SAE (Nabeshima, 2024; Bussmann, 2024).

…

Prelimineries

1.1 Mechanistic Interpretability

Mechanistic Interpretability seeks to reverse-engineer neural networks by analyzing their internal mechanisms and intermediate representations (Neel, 2021; Olah, 2022). This approach typically focuses on analyzing latent dimensions, leading to discoveries such as layer pattern features in CNN-based vision models (Olah et al., 2017; Cartern et al., 2019) and neuron-level features (Schubert et al., 2021; Goh et al., 2021). The success of the attention mechanism (Bahdanau et al., 2014; Parikh et al., 2016) and the Transformer model (Vaswani et al., 2017) has further spurred efforts to understand the emergent abilities of transformers (Wei, 2022).

1.2 Residual Stream

In transformer architectures, the residual stream—as described in Elhage et al.—is a continuous flow of fixed-dimensional vectors connected via residual connections. It serves as a communication channel between layers and attention heads (Elhage et al., 2021), making it a focal point of research on transformer capabilities (Olsson et al., 2022; Riggs, 2023).

1.3 Superposition Hypothesis

In neural network representations, the superposition of thought vectors (Goh, 2023) and word embeddings (Arora et al., 2018) has given rise to the superposition hypothesis. Using toy models, Elhage et al. detailed the emergence of the superposition hypothesis through the process of phase change in feature dimensionality, linking it to compressed sensing (Donoho, 2006; Bora et al., 2017). Additionally, activations in transformers are empirically found to be highly superpositioned (Gurnee et al., 2023). While this superposition effectively explains the operation of LLMs, its linearity remains a controversial topic (Mendel, 2024).

1.4 Linear Representation Hypothesis

In the vector representation space of neural networks, it is posited that neural networks exhibit linear directions in activation space (Mikolov et al., 2013). This has led to studies demonstrating that word embeddings reside in interpretable linear subspaces (Park et al., 2023) and that LLM representations are organized linearly (Elhage et al., 2022). Moreover, recent work by Wes Gurnee & Max Tegmark (2024) provides evidence for the linear representation hypothesis within a transformer's hidden states (residual stream). This hypothesis justifies the use of inner products, such as cosine similarity, directly in the latent space; in addition, Park et al., 2024 have proposed alternatives like the causal inner product.

Method

왜 같은 index act dist 다른지 - 뒤에서 400개 사용?

Weekly plan

내일 발렌타인 즐겁게 잘 지내기 예약한 두곳과 저녁 비프웰링턴 → scaling 논문읽기

주말에는 아마 미팅 스킵하고 sae activation 글 lesswrong 써서 주말 내로 올리기

민이 토욜 집가서 같이공부? scaling 논문읽기

positional encoding 제거하고 돌려보기

l1 l2 loss invectivation

turorial 2.0 다른거 있나 확인, 논문읽기

로스 다시 렌더링

ce difference llm after reconstruciton

position embedding 뺐을 때 sae loss/ce loss

Feature Umap

최종그래프

layerwise similarity 2개인데, (umap visualization animation)

Citation 빼고 적기 (who, 2024) → footnote 는 내 글 적고 link 는 그냥 링크, 제목 dtatset 수정

font size 2배해서 다시 렌더링

figma

residual stream visualization

common feature matching visualization

two part

gpt2 huggingface - SAE figure
gpt2 batchtopk - SAE matrix figure

다만 2개 테스트 다 geometric mean 적용 안해서 적용하면 비율 높아질수도
Same SAE top-2 부터 해서 중첩되는거 많은거 아닌가 하는 의시

Decoder weight UMAP, t-Sne - geometric mean initizliation

differenct dictionary size 일때 비교 larger or smaller

월욜에는 zekun 한테 corrsteer 확실히 방향 정해서 report share 하기

일단 1st report 문의 요청파악

그동안 activation 등 관련된거 graph 추려서 추가

SAE-TS to reconstruct feature 와
BiDPO 가 classification dataset 에 적합

다만 dataset diversity 는 1개로 우선순위 내리기 model diversity나

보고서는 all token steering 말고 quantile based last-k 로 진행

화욜에는 ir coursework 하루 진행

수욜 SAE feature RL 진행

목욜은 금욜 회의전 snlp 진행

crosscoder 논문읽고

금토일 nnet upload lesswrong
Π-Net, TreeSAE n-Net

encoder decoder force 하면 안되나 same symmetry

Nnet (with residual or not), synsae training → 완료 후 eluther embedding