Branch-Train-MiX

BTX

A method for efficiently adding domain-specific expertise (e.g., math, code) to large language models while maintaining generality. Unlike BTM which loses capabilities through parameter averaging, BTX preserves physically separated experts within one model using a Mixture-of-Experts structure.

Previously, training separate specialized models made integration impossible, and continual learning in a single model caused catastrophic forgetting of existing knowledge.

Branch-Train stage: Clone the base model (e.g., Llama-2 7B) multiple times and perform asynchronous parallel training for each domain

MiX stage: Integrate Feedforward layers into a Mixture-of-Experts (MoE) structure and train a Router to select appropriate experts for each input token

The Router selects experts using a Top-2 approach. Introduces Load Balancing Loss for balanced expert utilization. Routing analysis shows that different expert combinations are activated depending on task type to prevent expert collapse (dead expert).

arxiv.org

https://arxiv.org/pdf/2403.07816

Branch-Train-MiX

BTX

Recommendations