Olmoe

64 small experts per layer, 8 activated (fine-grained routing improves performance). 5.1T tokens pretraining + SFT/DPO to create OLMoE-INSTRUCT. Dropless token-choice routing with Load balancing loss (0.01), Router z-loss (0.001) for improved stability/quality. No shared experts, No sparse upcycling (inefficient for long training). Analysis results: High specialization among experts, rare co-activation, routing quickly fixed early in training. Only 1B of 7B activated. OLMoE-1B-7B is a fully open Mixture-of-Experts (MoE) language model. 1.3B out of 6.9B total parameters activated per token.

arxiv.org

https://arxiv.org/pdf/2409.02060

0924

allenai/OLMoE-1B-7B-0924-Instruct · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct

allenai/OLMoE-1B-7B-0924-Instruct · Hugging Face

0125

allenai/OLMoE-1B-7B-0125-Instruct · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct

Olmoe

Recommendations