Model Interpretability

Dense training updates all parameters on all data. Language models treat data homogeneously. This approach is problematic and we should develop new methods. Critical issues with dense training include harming generalization by emphasizing domains based on their prevalence. It reduces efficiency by requiring synchronous computation. It reduces flexibility by making LMs susceptible to catastrophic forgetting. It increases risk because we can't remove unwanted domains at test time since parameters are frozen after training.

Modular Language models specialize different parts of the expert model for different domains of data. There are three approaches: modular, asynchronous, and sparse. Examples include

MoNet,

Gradient Routing, and

Branch Train Merge. MoE (Mixture of Experts) is the dominant modular architecture. An open question remains: does everything really come from pretraining?

Model Interpretability Methods

LoRA
LoRA-Models-for-SAEs
matchten • Updated 2025 Jul 28 13:50
(LoRA on LLMs not SAE)

For fine-tuning, we used approximately 15M tokens randomly sampled from The Pile dataset. For evaluation, we also used a separate validation set of 1M tokens and significantly reduced the distance between logits using LoRA. For the loss, they used KL divergence loss of logits for LoRA training.

arxiv.org

https://arxiv.org/pdf/2501.19406v1

Model Interpretability

LoRA
LoRA-Models-for-SAEs
matchten • Updated 2025 Jul 28 13:50
(LoRA on LLMs not SAE)

Backlinks

Recommendations

Model Interpretability

LoRA LoRA-Models-for-SAEsmatchten • Updated 2025 Jul 28 13:50 (LoRA on LLMs not SAE)

Backlinks

Recommendations

LoRA
LoRA-Models-for-SAEs
matchten • Updated 2025 Jul 28 13:50
(LoRA on LLMs not SAE)