MoE recursive hierarchy MoE is available
MoEs replace dense feed-forward network layers with sparse MoE layers, consisting of a certain number of "experts", each being a neural network. This setup enables efficient pre-training and faster inference compared to dense models.
MoEs enable more compute-efficient pretraining compared to dense models, allowing for scaling up the model or dataset size with the same compute budget.
Sparsely Gated MoE
Outrageously Large Neural Networks: The Sparsely-Gated...
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been...
https://arxiv.org/abs/1701.06538


Seonglae Cho