Pre-bias gradient

Creator

Creator

Created

Created

2025 Feb 3 12:58

Editor

Editor

Edited

Edited

2025 Feb 3 13:9

Refs

Refs

The pre-bias gradient uses a trick of summing pre-activation gradient across the batch dimension before multiplying with the encoder weights. In other words, we constrain the pre-encoder bias to equal the negative of the post-decoder bias and initialize it to the geometric median of the dataset.

Each training step of a sparse autoencoder generally consists of six major computations

the encoder forward pass

the encoder gradient

the decoder forward pass

the decoder gradient

the latent gradient

the pre-bias gradient

Geometric median initialization

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-autoencoder-bias

insight of Pre-bias

More findings on Memorization and double descent — AI Alignment Forum

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. …

More findings on Memorization and double descent — AI Alignment Forum

https://www.alignmentforum.org/posts/KzwB4ovzrZ8DYWgpw/more-findings-on-memorization-and-double-descent

More findings on Memorization and double descent — AI Alignment Forum

Recommendations

////////////