Cluster data as preprocessing stepTrain LMsparameter averaging to merge them, but this causes the knowledge from each expert to mix or cancel out, making it difficult to preserve domain-specific capabilities. Branch-Train-Merge: Embarrassingly Parallel Training of Expert...We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train...https://arxiv.org/abs/2208.03306