BTX
A method for efficiently adding domain-specific expertise (e.g., math, code) to large language models while maintaining generality. Unlike BTM which loses capabilities through parameter averaging, BTX preserves physically separated experts within one model using a Mixture-of-Experts structure.
Previously, training separate specialized models made integration impossible, and continual learning in a single model caused catastrophic forgetting of existing knowledge.
- Branch-Train stage: Clone the base model (e.g., Llama-2 7B) multiple times and perform asynchronous parallel training for each domain
- MiX stage: Integrate Feedforward layers into a Mixture-of-Experts (MoE) structure and train a Router to select appropriate experts for each input token
The Router selects experts using a Top-2 approach. Introduces Load Balancing Loss for balanced expert utilization. Routing analysis shows that different expert combinations are activated depending on task type to prevent expert collapse (dead expert).

Seonglae Cho