Π-Net, TreeSAE n-Net

Sparse Autoencoders(SAEs) proved ability to generate interpretable features by decomposing residual stream into sparse latent dimensions. Wide range of features are found activating such as single-token feature or context feautres like base64 or Arabic feature so called context feature. As previous research shown, feature splitting is highly dependent on dictionary size of SAEs and context feature tends to be fragmented along with the increasing dictionary size. Based on the natural approach is that hierarchical architecture could capture hierarchical splliting features by multi-level encoding, I tested three variants . The results show what I compared each layer's feature and training statistics with baseline top-k SAE. Also notated this architecture's limitation explicitly and compared with similar appoaches like Matryoshka SAE.

I tested three variants and analyzed why a one of that design outperform others but still behind the SoTA topk and batchtopk SAEs.

SAE Dead Feature is so tiny amount

layer 별로 다른 activaiton 적용 hiarachical

좋은 아이디어: hierarchical SAE인데 gradually 증가했다가 gradually 감소하는데 모든 layer 에 sparsity regularizer 적용해서 high level low level 가능하도록 gemma 나 gpt2 인데 transcoder면 좋을듯 (문제제기: sparse autoencoder 너무 다양하다. 다른 문제는 monosementic 너무 강해서 high level context catch 못함)

monosementic 한 feature 조합 circuit 으로 high level 찾는게 앞으로 핵심인데 내논문은 성능향상? 혹은 jailbreak

steeirng 할때 steering vector말고 sae에서 그거 activation 추가하는 방식으로 circuit 사용하는게 좋겠다

Importatnt design choice

먼저 늘이고줄일지 (이게 나을수도>)

sparsity scaling factor per 2 expansion factor with mulitplying along with sparisity coeffienct

점차적으로 늘려갈지

dimension normalization to prevent
Attention Sink

layer normalization between output and input

너무 parameter 많으면
Bottleneck layer 로 low rank 인데 그 low rank 가 original dimension

Hierarchical SAE

Problem Statement: SAE cannot captures high-level and low-level features simultaneously

Proposal: Design a multi-layer hierarchical SAE by applying sparsity loss to each layer’s latent

Method

Ensuring low correlation between features of different levels of hierarchy to verify distinctive features are discovered through hierarchical SAE (or through correlation loss?)
A single-layer decoder that concatenates all feature dimensions with layer normalization?

Baseline

Pythia SAE from (Cunningham, 2023)
GPT2 Top-K SAE from (Gao, 2024)
Gemma2 Gated SAE from (Rajamanoharan, 2024)

Datasets

The Pile for training (which is most common)
Toy dataset with Neel Nanda's The Pile 10k

Benchmarks

SAEBench focusing on sparsity and reconstruction loss

Break down into parts

Reproducing SAE training

Improving and Design architecture

Hierarchical Encoder

Dead neuron mitigation

Ghost gradient method
Auxiliary-k loss method

Optimizing training time

Distributed training such as OpenAI Top-K SAE codebase
Bottleneck layer

Switch SAE

How about simply averaging SAE with multiple expansion factors with different coefficient

Π-Net, TreeSAE n-Net

SAE Dead Feature is so tiny amount

Importatnt design choice

너무 parameter 많으면 Bottleneck layer 로 low rank 인데 그 low rank 가 original dimension

Switch SAE

Recommendations

너무 parameter 많으면
Bottleneck layer 로 low rank 인데 그 low rank 가 original dimension