Branch-Train-MiX

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 19 23:30
Editor
Edited
Edited
2025 Nov 3 22:27

BTX

A method for efficiently adding domain-specific expertise (e.g., math, code) to large language models while maintaining generality. Unlike BTM which loses capabilities through parameter averaging, BTX preserves physically separated experts within one model using a Mixture-of-Experts structure.
Previously, training separate specialized models made integration impossible, and continual learning in a single model caused catastrophic forgetting of existing knowledge.
  1. Branch-Train stage: Clone the base model (e.g., Llama-2 7B) multiple times and perform asynchronous parallel training for each domain
  1. MiX stage: Integrate Feedforward layers into a Mixture-of-Experts (MoE) structure and train a Router to select appropriate experts for each input token
The Router selects experts using a Top-2 approach. Introduces Load Balancing Loss for balanced expert utilization. Routing analysis shows that different expert combinations are activated depending on task type to prevent expert collapse (dead expert).
 
 
 
 
 
 

Recommendations