AI Load BalancingUsually MLP routing is done per layer (since attention weights are shared) and routing is based on Affinity Score, then top-k is selected and weighted sum is performed based on the scores.MoE Routing NotionMoE Gating networkSparse Gated MoEAuxiliary-Loss-Free Load BalancingAuxiliary Loss for Load BalanceAffinity ScoreMoE Node-Limited Routing Load Balancing loss (Parallel Training)Global-batch load balance almost free lunch to improve your MoE LLM trainingGITHUB HUGGING FACE MODELSCOPE DISCORD Background The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models, each expert is one feedforward layer). Given an input, only a subset of experts will be activated, and then their outputs will be aggregated based on the scores the router assigned.https://qwenlm.github.io/blog/global-load-balance/Mixture-of-Experts with Expert Choice Routinghttps://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.htmlOctopusNexaAIDev/Octopus-v4 · Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/NexaAIDev/Octopus-v4DeepSeekDeepSeek-V2deepseek-ai • Updated 2025 Feb 22 13:51