SAE weight initialization

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Feb 12 22:41
Editor
Edited
Edited
2025 Apr 6 19:5
Refs
Refs
 
 
 

Weight initialization

Uniform distribution with transpose matrix between encoder and decoder. If the SAE is transcoder, the encoder matrix is also initialized from uniform distribution
Circuits Updates - January 2025
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

Bias initialization

geometric median for bias
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
cdn.openai.com
Pass a portion of the data through the model to measure the pre-activation value distribution of each feature. When each feature has a certain bias, use a threshold to adjust for the feature to be activated approximately times across the entire dataset where m is dictionary size and l is the activated count of pre-activation.
Circuits Updates - January 2025
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
 
 

Recommendations