MoE recursive hierarchy MoE is available
MoEs replace dense feed-forward network layers with sparse MoE layers, consisting of a certain number of "experts", each being a neural network. This setup enables efficient pre-training and faster inference compared to dense models.
MoEs enable more compute-efficient pretraining compared to dense models, allowing for scaling up the model or dataset size with the same compute budget.