Loading views...

드디어 2 Visualized analysis LessWrong 제대로 적기

Date
Date
2025 Feb 20 0:0 → 2025 Feb 21 0:0
Created by
Created by
Seonglae ChoSeonglae Cho
Created time
Created time
2025 Feb 20 16:8
Last edited by
Last edited by
Seonglae ChoSeonglae Cho
Last edited time
Last edited time
2025 Feb 20 16:39
Refs
Refs

Visualized Analysis about Residual SAEs for activation value and feature matching

Abstract

Sparse Autoencoders (SAEs) linearly disentangle interpretable features from a large language model's intermediate representations. However, the basic dynamics of SAEs—such as the activation values of SAE features and the encoder and decoder weights—have not been as extensively visualized as their implications. To shed light on the properties of feature activation values and the emergence of SAE features, I conducted a two-part visual analysis: (1) an analysis of SAE feature activations across token positions in comparison with other layers, and (2) a feature matching analysis across different SAEs based on decoder weights under diverse training settings. The first analysis revealed intriguing traits related to token positions and positional embeddings. The second analysis initially identified differences between encoder and decoder weights in feature matching, and examined the relative importance of factors such as the dataset, seed, SAE type, and dictionary size, all of which contribute to distinctive features across layers.

Introduction

The Sparse Autoencoder (SAE) architecture, introduced by Faruqui et al., has demonstrated the capacity to decompose interpretable features in a linear fashion (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023). SAE latent dimensions can be interpreted as monosemantic features by disentangling superpositioned neuron activations from the LLM's linear activations. This approach offers broad interpretability by reconstructing transformer residual streams (Gao et al., 2024), MLP activations (Bricken et al., 2023), and even dense word embeddings (O'Neill et al., 2024). Following its demonstrated efficiency, further exploration has uncovered novel architectures incorporating various activation functions (such as Top-K) and proposals for multi-level feature SAEs, including the Matryoshka SAE (Nabeshima, 2024; Bussmann, 2024).

Prelimineries

1.1 Mechanistic Interpretability

Mechanistic Interpretability seeks to reverse-engineer neural networks by analyzing their internal mechanisms and intermediate representations (Neel, 2021; Olah, 2022). This approach typically focuses on analyzing latent dimensions, leading to discoveries such as layer pattern features in CNN-based vision models (Olah et al., 2017; Cartern et al., 2019) and neuron-level features (Schubert et al., 2021; Goh et al., 2021). The success of the attention mechanism (Bahdanau et al., 2014; Parikh et al., 2016) and the Transformer model (Vaswani et al., 2017) has further spurred efforts to understand the emergent abilities of transformers (Wei, 2022).

1.2 Residual Stream

In transformer architectures, the residual stream—as described in Elhage et al.—is a continuous flow of fixed-dimensional vectors connected via residual connections. It serves as a communication channel between layers and attention heads (Elhage et al., 2021), making it a focal point of research on transformer capabilities (Olsson et al., 2022; Riggs, 2023).

1.3 Superposition Hypothesis

In neural network representations, the superposition of thought vectors (Goh, 2023) and word embeddings (Arora et al., 2018) has given rise to the superposition hypothesis. Using toy models, Elhage et al. detailed the emergence of the superposition hypothesis through the process of phase change in feature dimensionality, linking it to compressed sensing (Donoho, 2006; Bora et al., 2017). Additionally, activations in transformers are empirically found to be highly superpositioned (Gurnee et al., 2023). While this superposition effectively explains the operation of LLMs, its linearity remains a controversial topic (Mendel, 2024).

1.4 Linear Representation Hypothesis

In the vector representation space of neural networks, it is posited that neural networks exhibit linear directions in activation space (Mikolov et al., 2013). This has led to studies demonstrating that word embeddings reside in interpretable linear subspaces (Park et al., 2023) and that LLM representations are organized linearly (Elhage et al., 2022). Moreover, recent work by Wes Gurnee & Max Tegmark (2024) provides evidence for the linear representation hypothesis within a transformer's hidden states (residual stream). This hypothesis justifies the use of inner products, such as cosine similarity, directly in the latent space; in addition, Park et al., 2024 have proposed alternatives like the causal inner product.

Method

 
 
 
 
왜 같은 index act dist 다른지 - 뒤에서 400개 사용?

Weekly plan

  • 내일 발렌타인 즐겁게 잘 지내기 예약한 두곳과 저녁 비프웰링턴 → scaling 논문읽기
  • 주말에는 아마 미팅 스킵하고 sae activation 글 lesswrong 써서 주말 내로 올리기
    • 민이 토욜 집가서 같이공부? scaling 논문읽기
    • positional encoding 제거하고 돌려보기
      l1 l2 loss invectivation
      turorial 2.0 다른거 있나 확인, 논문읽기
      로스 다시 렌더링
      ce difference llm after reconstruciton
      position embedding 뺐을 때 sae loss/ce loss
      Feature Umap
      최종그래프
      layerwise similarity 2개인데, (umap visualization animation)
      Citation 빼고 적기 (who, 2024) → footnote 는 내 글 적고 link 는 그냥 링크, 제목 dtatset 수정
      font size 2배해서 다시 렌더링
      figma
      residual stream visualization
      common feature matching visualization
    • two part
      • gpt2 huggingface - SAE figure
      • gpt2 batchtopk - SAE matrix figure
        • 다만 2개 테스트 다 geometric mean 적용 안해서 적용하면 비율 높아질수도
        • Same SAE top-2 부터 해서 중첩되는거 많은거 아닌가 하는 의시
        • Decoder weight UMAP, t-Sne - geometric mean initizliation
          differenct dictionary size 일때 비교 larger or smaller
월욜에는 zekun 한테 corrsteer 확실히 방향 정해서 report share 하기
  • 일단 1st report 문의 요청파악
  • 그동안 activation 등 관련된거 graph 추려서 추가
  • SAE-TS
    to reconstruct feature 와
    BiDPO
    가 classification dataset 에 적합
  • 다만 dataset diversity 는 1개로 우선순위 내리기 model diversity나
  • 보고서는 all token steering 말고 quantile based last-k 로 진행
  • 화욜에는 ir coursework 하루 진행
수욜 SAE feature RL 진행
  • 목욜은 금욜 회의전 snlp 진행
    • crosscoder 논문읽고
  • 금토일 nnet upload lesswrong
    Π-Net, TreeSAE n-Net
    • encoder decoder force 하면 안되나 same symmetry
    • Nnet (with residual or not), synsae training → 완료 후 eluther embedding
 
 
 
 
 
 

Recommendations