Neuron SAE Implementation

Creator

Creator

Created

Created

2024 Oct 31 9:45

Editor

Editor

Edited

Edited

2025 Jun 7 15:56

Refs

Refs

Audio Model SAE

Code base

Training hyperparameters

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features#appendix-hyperparameters

demo

https://colab.research.google.com/drive/17dQFYUYnuKnP6OwQPH9v_GSYUW5aj-Rp?usp=sharing#scrollTo=mJ6bUncxGN2Y

Google Colab

sae

https://colab.research.google.com/drive/1PlFzI_PWGTN9yCQLuBcSuPJUjgHL7GiD#scrollTo=SXLZn776f_2J

Google Colab

Recommendations

///////////