Sparse Gated MoE

MoE recursive hierarchy MoE is available

MoEs replace dense feed-forward network layers with sparse MoE layers, consisting of a certain number of "experts", each being a neural network. This setup enables efficient pre-training and faster inference compared to dense models.

MoEs enable more compute-efficient pretraining compared to dense models, allowing for scaling up the model or dataset size with the same compute budget.

Sparsely Gated MoE

Outrageously Large Neural Networks: The Sparsely-Gated...

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been...

https://arxiv.org/abs/1701.06538

github.com

https://github.com/kimbochen/md-blogs/tree/main/mobile-v-moes

Mixture-of-Experts with Expert Choice Routing

https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html

Sparse Gated MoE

Sparsely Gated MoE

Recommendations