SAE Implementation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 31 9:45
Editor
Edited
Edited
2025 Dec 18 16:35

SAE Model

Code base

from sae_lens.toolkit.pretrained_saes_directory import get_pretrained_saes_directory df = pd.DataFrame.from_records( {k: v.__dict__ for k, v in get_pretrained_saes_directory().items()} ).T df.drop( columns=[ "expected_var_explained", "expected_l0", "config_overrides", "conversion_func", ], inplace=True, ) df
Neuron SAE Implementations
 
 
 
 

Collection

Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability - a dlouapre Collection
A compilation of sparse auto-encoders trained on large language models.
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability - a dlouapre Collection

Training hyperparameters

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
demo
Google Colab
Google Colab
sae
Google Colab
Google Colab
 
 
 

Recommendations