64 small experts per layer, 8 activated (fine-grained routing improves performance). 5.1T tokens pretraining + SFT/DPO to create OLMoE-INSTRUCT. Dropless token-choice routing with Load balancing loss (0.01), Router z-loss (0.001) for improved stability/quality. No shared experts, No sparse upcycling (inefficient for long training). Analysis results: High specialization among experts, rare co-activation, routing quickly fixed early in training. Only 1B of 7B activated. OLMoE-1B-7B is a fully open Mixture-of-Experts (MoE) language model. 1.3B out of 6.9B total parameters activated per token.

0924
0125

Seonglae Cho