The pre-bias gradient uses a trick of summing pre-activation gradient across the batch
dimension before multiplying with the encoder weights. In other words, we constrain the pre-encoder bias to equal the negative of the post-decoder bias and initialize it to the geometric median of the dataset.
Each training step of a sparse autoencoder generally consists of six major computations
- the encoder forward pass
- the encoder gradient
- the decoder forward pass
- the decoder gradient
- the latent gradient
- the pre-bias gradient
Geometric median initialization
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-autoencoder-bias
insight of Pre-bias
More findings on Memorization and double descent — AI Alignment Forum
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. …
https://www.alignmentforum.org/posts/KzwB4ovzrZ8DYWgpw/more-findings-on-memorization-and-double-descent


Seonglae Cho