SAE Model
Code base
Neuron SAE Implementations
Collection
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability - a dlouapre Collection
A compilation of sparse auto-encoders trained on large language models.
https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability
Training hyperparameters
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
https://transformer-circuits.pub/2023/monosemantic-features#appendix-hyperparameters
demo
Google Colab
https://colab.research.google.com/drive/17dQFYUYnuKnP6OwQPH9v_GSYUW5aj-Rp?usp=sharing#scrollTo=mJ6bUncxGN2Y
sae
Google Colab
https://colab.research.google.com/drive/1PlFzI_PWGTN9yCQLuBcSuPJUjgHL7GiD#scrollTo=SXLZn776f_2J

Seonglae Cho