Randomized Transformer

Random
Weight Initialization

Strong
Inductive Bias of inherent model

Embedding training only models work for simple pattern matching.

SAE only single token features

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features

Is SAE really meaningful or it is the property of sparse text dataset? (that only single-token features are discovered) I suspect the results are more due to the structure of the transformer itself rather than the superposition in token embeddings, since the comparison between random and intact token embeddings showed similar outcomes. This makes me curious about how these findings would generalize to other architectures.

arxiv.org

https://arxiv.org/pdf/2501.17727v1

Local KL Volume

This methodology defines a set of KL-neighbors (behaviorally similar region) around the trained model weights and efficiently estimates the probability (=Local KL Volume) that this region occupies under the initialization distribution using

Monte Carlo Method +

Importance sampling. Local KL Volume measures the "size" of the parameter region where the output distribution remains nearly unchanged (KL divergence ≤ ε) from the perspective of the initialization distribution.

KL volume is data-dependent, and the ratio of KL local volume between test and train datasets can be used to assess

Overfitting. If the ratio of valid to train is less than 1, it indicates overfitting; if it is close to 1, it's optimal; and if greater than 1, it suggests

Underfitting. Using second-moment information from optimizers like

Adam Optimizer reduces directional variance, significantly decreasing the variance in volume estimation. The negative log of local volume can be interpreted as network information content (from an

MDL perspective) and linked to generalization performance. As training progresses towards overfitting, local volume decreases (complexity increases).

arxiv.org

https://arxiv.org/pdf/2501.18812

Randomized Transformer

Random Weight Initialization

Strong Inductive Bias of inherent model

Local KL Volume

Recommendations

Random
Weight Initialization

Strong
Inductive Bias of inherent model