Π-Net, TreeSAE n-Net

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 17 18:59
Editor
Edited
Edited
2025 Dec 23 23:34
Refs
Refs
POC
POC
POC
Long
Long
Long
Archived
Archived
Archived
Sparse Autoencoders(SAEs) proved ability to generate interpretable features by decomposing residual stream into sparse latent dimensions. Wide range of features are found activating such as single-token feature or context feautres like base64 or Arabic feature so called context feature. As previous research shown, feature splitting is highly dependent on dictionary size of SAEs and context feature tends to be fragmented along with the increasing dictionary size. Based on the natural approach is that hierarchical architecture could capture hierarchical splliting features  by multi-level encoding, I tested three variants . The results show what I compared each layer's feature and training statistics with baseline top-k SAE. Also notated this architecture's limitation explicitly and compared with similar appoaches like Matryoshka SAE.
I tested three variants and analyzed why a one of that design outperform others but still behind the SoTA topk and batchtopk SAEs.

SAE Dead Feature
is so tiny amount

  • layer 별로 다른 activaiton 적용 hiarachical
좋은 아이디어: hierarchical SAE인데 gradually 증가했다가 gradually 감소하는데 모든 layer 에 sparsity regularizer 적용해서 high level low level 가능하도록 gemma 나 gpt2 인데 transcoder면 좋을듯 (문제제기: sparse autoencoder 너무 다양하다. 다른 문제는 monosementic 너무 강해서 high level context catch 못함)
monosementic 한 feature 조합 circuit 으로 high level 찾는게 앞으로 핵심인데 내논문은 성능향상? 혹은 jailbreak
steeirng 할때 steering vector말고 sae에서 그거 activation 추가하는 방식으로 circuit 사용하는게 좋겠다

Importatnt design choice

  • 먼저 늘이고줄일지 (이게 나을수도>)
    • sparsity scaling factor per 2 expansion factor with mulitplying along with sparisity coeffienct
  • 점차적으로 늘려갈지
  • layer normalization between output and input

너무 parameter 많으면
Bottleneck layer
로 low rank 인데 그 low rank 가 original dimension

  1. Hierarchical SAE
  • Problem Statement: SAE cannot captures high-level and low-level features simultaneously
  • Proposal: Design a multi-layer hierarchical SAE by applying sparsity loss to each layer’s latent
  • Method
    • Ensuring low correlation between features of different levels of hierarchy to verify distinctive features are discovered through hierarchical SAE (or through correlation loss?)
    • A single-layer decoder that concatenates all feature dimensions with layer normalization?
  • Benchmarks
    • SAEBench focusing on sparsity and reconstruction loss

Switch SAE

How about simply averaging SAE with multiple expansion factors with different coefficient
 
 
 
 
 
 

Recommendations