Weight initialization
Uniform distribution with transpose matrix between encoder and decoder. If the SAE is transcoder, the encoder matrix is also initialized from uniform distribution
Circuits Updates - January 2025
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
https://transformer-circuits.pub/2025/january-update/index.html
Bias initialization
geometric median for bias
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
https://transformer-circuits.pub/2023/monosemantic-features#appendix-autoencoder-bias
Pass a portion of the data through the model to measure the pre-activation value distribution of each feature. When each feature has a certain bias, use a threshold to adjust for the feature to be activated approximately times across the entire dataset where m is dictionary size and l is the activated count of pre-activation.
Circuits Updates - January 2025
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
https://transformer-circuits.pub/2025/january-update/index.html

Seonglae Cho