UCL SNLP SAE Research

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 17 18:7
Editor
Edited
Edited
2025 Jan 24 11:59
Refs
Refs

What is SAE

notion image

Research Areas for SAE

Limitations of SAE Research

  • SAE cannot capture high-level context
  • Training cost of SAE is quite expensive
  • Training for each layer of SAE per model is inefficient

Research Ideas

  • Building Hierarchical SAE for capturing high-level context. For example, expanding and shrinking latent dimension granularly (256 → 512 → 1024 …)
  • Multi-token SAE: considering multiple tokens' attention residual vectors at the same time, unlike conventional SAE inference where each token is independent (change input as a sequence)

Additional Resources

 
 
 
 
 
 
 

SAE pipeline

  1. Choose LLM
    1. GPT2 ←
    2. Gemma 2 3b ← Gemma Scope
  1. Architect SAE
  1. Train SAE
  1. Interpret features automatically
  1. Select useful features manually
  1. Extract steering vectors and coefficient
  1. Inference and evaluate
      • Design choice how to apply
        • Use SAE for inference: fix feature value and replace by reconstructed activation
        • Use only steering vector: add steering vector with coefficient
      • Which tokens should we apply steering vector
 
 
 
 
 
 

Proposals

  • ReFAT with SAE refusal feature - 4 to 7
  • MoE of SAE
  • hierarchical SAE (resevered-unet)
  • convering sequencital sAE
 
traditional SAE input/output per each token
sequential SAE
 
 
 
 
 
LLM Visualization
A 3D animated visualization of an LLM with a walkthrough.
 
 
 

Recommendations