What is SAE
- EleutherAI: https://arxiv.org/abs/2309.08600

Research Areas for SAE
- Improving SAE model architecture
- Gated SAE (JumpReLU) https://arxiv.org/pdf/2404.16014
- TopK SAE https://cdn.openai.com/papers/sparse-autoencoders.pdf
- BatchTopK SAE https://arxiv.org/pdf/2412.06410
- Hierarchical RQAE https://hkamath.me/blog/2024/rqae/
- Training SAE methods
- Transcoders (takes MLP input as SAE input and reconstructs output of MLP layer instead of reconstructing the same vector) https://arxiv.org/pdf/2406.11944
- Transfer Learning along Layers (not compatible) https://aclanthology.org/2024.blackboxnlp-1.32.pdf
- Layer-compatible Crosscoder https://transformer-circuits.pub/2024/crosscoders/index.html

- Extracting and interpreting SAE features automatically
- LLM as a Neuron Explainer https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
- VocabProj https://arxiv.org/pdf/2501.08319
- Activation reconstructed by SAE decoder can be used as Steering Vector
Limitations of SAE Research
- SAE cannot capture high-level context
- Training cost of SAE is quite expensive
- Training for each layer of SAE per model is inefficient
Research Ideas
- Building Hierarchical SAE for capturing high-level context. For example, expanding and shrinking latent dimension granularly (256 → 512 → 1024 …)
- Multi-token SAE: considering multiple tokens' attention residual vectors at the same time, unlike conventional SAE inference where each token is independent (change input as a sequence)
Additional Resources
- Super Weight https://arxiv.org/abs/2411.07191
- Refusal is a single direction https://arxiv.org/abs/2406.11717
- Adversarial training via refusal feature (Karen co-auth) https://arxiv.org/abs/2409.20089
SAE pipeline
- Choose LLM
- GPT2 ←
- Gemma 2 3b ← Gemma Scope
- Architect SAE
- Train SAE
- Interpret features automatically
- Select useful features manually
- Extract steering vectors and coefficient
- Inference and evaluate
- Design choice how to apply
- Use SAE for inference: fix feature value and replace by reconstructed activation
- Use only steering vector: add steering vector with coefficient
- Which tokens should we apply steering vector
Proposals
- ReFAT with SAE refusal feature - 4 to 7
- MoE of SAE
- hierarchical SAE (resevered-unet)
- convering sequencital sAE
traditional SAE input/output per each token
sequential SAE
LLM Visualization
A 3D animated visualization of an LLM with a walkthrough.
https://bbycroft.net/llm
Seonglae Cho