Weight Interference

Now for logit weight

Anthropic hypothesize this central mode corresponds to "weight interference" and that the shared outlier mode is the important observation – that is, the model may ideally prefer to have all those weights be zero, but due to superposition with other features and their weights, this isn't possible.

https://transformer-circuits.pub/2023/monosemantic-features#feature-arabic-effect

first and second mode term

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features#feature-arabic-effect

Weight Interference

Now for logit weight

Recommendations