Olmoe

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 31 11:34
Editor
Edited
Edited
2026 Jan 14 12:19
64 small experts per layer, 8 activated (fine-grained routing improves performance). 5.1T tokens pretraining + SFT/DPO to create OLMoE-INSTRUCT. Dropless token-choice routing with Load balancing loss (0.01), Router z-loss (0.001) for improved stability/quality. No shared experts, No sparse upcycling (inefficient for long training). Analysis results: High specialization among experts, rare co-activation, routing quickly fixed early in training. Only 1B of 7B activated. OLMoE-1B-7B is a fully open Mixture-of-Experts (MoE) language model. 1.3B out of 6.9B total parameters activated per token.
https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct
 
 
 
 
0924
0125
 

Recommendations