SAE weight initialization

Weight initialization

Uniform distribution with transpose matrix between encoder and decoder. If the SAE is transcoder, the encoder matrix is also initialized from uniform distribution

Circuits Updates - January 2025

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2025/january-update/index.html

Bias initialization

geometric median for bias

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features#appendix-autoencoder-bias

cdn.openai.com

https://cdn.openai.com/papers/sparse-autoencoders.pdf

Pass a portion of the data through the model to measure the pre-activation value distribution of each feature. When each feature has a certain bias, use a threshold to adjust for the feature to be activated approximately

\frac{l}{m}

times across the entire dataset where m is dictionary size and l is the activated count of pre-activation.

Circuits Updates - January 2025

https://transformer-circuits.pub/2025/january-update/index.html

SAE weight initialization

Weight initialization

Bias initialization

Backlinks

Recommendations