SAE Implementation

Creator

Creator

Seonglae Cho

Created

Created

2024 Oct 31 9:45

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Dec 18 16:35

Refs

Refs

SAE Model

Code base


from sae_lens.toolkit.pretrained_saes_directory import get_pretrained_saes_directory

df = pd.DataFrame.from_records(
    {k: v.__dict__ for k, v in get_pretrained_saes_directory().items()}
).T
df.drop(
    columns=[
        "expected_var_explained",
        "expected_l0",
        "config_overrides",
        "conversion_func",
    ],
    inplace=True,
)
df

Neuron SAE Implementations

Collection

Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability - a dlouapre Collection

A compilation of sparse auto-encoders trained on large language models.

https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability

Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability - a dlouapre Collection

Training hyperparameters

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features#appendix-hyperparameters

demo

https://colab.research.google.com/drive/17dQFYUYnuKnP6OwQPH9v_GSYUW5aj-Rp?usp=sharing#scrollTo=mJ6bUncxGN2Y

Google Colab

sae

https://colab.research.google.com/drive/1PlFzI_PWGTN9yCQLuBcSuPJUjgHL7GiD#scrollTo=SXLZn776f_2J

Google Colab

Recommendations

////////////