Model Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Feb 3 20:57
Editor
Edited
Edited
2025 Aug 8 0:23
Refs
Refs
MoE
Dense training updates all parameters on all data. Language models treat data homogeneously. This approach is problematic and we should develop new methods. Critical issues with dense training include harming generalization by emphasizing domains based on their prevalence. It reduces efficiency by requiring synchronous computation. It reduces flexibility by making LMs susceptible to catastrophic forgetting. It increases risk because we can't remove unwanted domains at test time since parameters are frozen after training.
Modular Language models specialize different parts of the expert model for different domains of data. There are three approaches: modular, asynchronous, and sparse. Examples include
MoNet
,
Gradient Routing
, and
Branch Train Merge
. MoE (Mixture of Experts) is the dominant modular architecture. An open question remains: does everything really come from pretraining?
Model Interpretability Methods
 
 
 
 
 

LoRA
LoRA-Models-for-SAEs
matchtenUpdated 2025 Jul 28 13:50
(LoRA on LLMs not SAE)

For fine-tuning, we used approximately 15M tokens randomly sampled from The Pile dataset. For evaluation, we also used a separate validation set of 1M tokens and significantly reduced the distance between logits using LoRA. For the loss, they used KL divergence loss of logits for LoRA training.
 
 
 

Recommendations