UCL SNLP SAE Research

What is SAE

EleutherAI: https://arxiv.org/abs/2309.08600

notion image

Anthropic: https://transformer-circuits.pub/2023/monosemantic-features

Feature Browser: https://transformer-circuits.pub/2023/monosemantic-features/vis/a1.html

OpenAI: https://openai.com/index/extracting-concepts-from-gpt-4/

GPT2-SAE: https://huggingface.co/jbloom/GPT2-Small-SAEs-Reformatted/tree/main

Research Areas for SAE

Improving SAE model architecture

Gated SAE (JumpReLU) https://arxiv.org/pdf/2404.16014
TopK SAE https://cdn.openai.com/papers/sparse-autoencoders.pdf
BatchTopK SAE https://arxiv.org/pdf/2412.06410
Hierarchical RQAE https://hkamath.me/blog/2024/rqae/

Training SAE methods

Transcoders (takes MLP input as SAE input and reconstructs output of MLP layer instead of reconstructing the same vector) https://arxiv.org/pdf/2406.11944
Transfer Learning along Layers (not compatible) https://aclanthology.org/2024.blackboxnlp-1.32.pdf
Layer-compatible Crosscoder https://transformer-circuits.pub/2024/crosscoders/index.html

notion image

Extracting and interpreting SAE features automatically

LLM as a Neuron Explainer https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
VocabProj https://arxiv.org/pdf/2501.08319

Activation reconstructed by SAE decoder can be used as Steering Vector

Bias steering: https://www.anthropic.com/research/evaluating-feature-steering

Limitations of SAE Research

SAE cannot capture high-level context

Training cost of SAE is quite expensive

Training for each layer of SAE per model is inefficient

Research Ideas

Building Hierarchical SAE for capturing high-level context. For example, expanding and shrinking latent dimension granularly (256 → 512 → 1024 …)

Multi-token SAE: considering multiple tokens' attention residual vectors at the same time, unlike conventional SAE inference where each token is independent (change input as a sequence)

Additional Resources

Super Weight https://arxiv.org/abs/2411.07191

Refusal is a single direction https://arxiv.org/abs/2406.11717

Adversarial training via refusal feature (Karen co-auth) https://arxiv.org/abs/2409.20089

SAE pipeline

Choose LLM

GPT2 ←
Gemma 2 3b ← Gemma Scope

Architect SAE

Train SAE

Interpret features automatically

Select useful features manually

Extract steering vectors and coefficient

Inference and evaluate

Design choice how to apply

Use SAE for inference: fix feature value and replace by reconstructed activation
Use only steering vector: add steering vector with coefficient

Which tokens should we apply steering vector

Proposals

ReFAT with SAE refusal feature - 4 to 7

MoE of SAE

hierarchical SAE (resevered-unet)

convering sequencital sAE

traditional SAE input/output per each token

sequential SAE

LLM Visualization

A 3D animated visualization of an LLM with a walkthrough.

https://bbycroft.net/llm

Recommendations