Problem: non-deterministic
Backward memory overhead is huge
TL;DR: up to 11× faster MoE inference in Transformers 🤯 Excited to share that my PR on optimizing the experts implementation for MoEs just got merged into 🤗 Hugging Face Transformers. We added a… | Ilyas Moutawwakil | 13 comments
TL;DR: up to 11× faster MoE inference in Transformers 🤯 Excited to share that my PR on optimizing the experts implementation for MoEs just got merged into 🤗 Hugging Face Transformers. We added a new API for MoEs, with two new optimized paths to choose from: *grouped_mm* and *batched_mm*. These paths achieve substantial inference speedups compared to the previous eager implementation and behave much better under PyTorch compilation. In practice, the new backends can typically provide ~6–11× speedups when used appropriately: - *grouped_mm* is the best all-rounder: it’s overall the fastest option and scales very well for large input sizes, which makes it the default choice in most scenarios. However, it comes with some requirements, such as PyTorch 2.9 and bfloat16 when compiled. - *batched_mm* was originally meant for torch.export compatibility, but it demonstrated an incredible edge for small input sizes. It works especially well when combined with torch.compile, as it benefits the most from maximum auto-tuning, making it a great option for decoding workloads. The attached plot summarizes this; full benchmarks and reproduction scripts can be found in the PR and documentation (links in the first comment). Why this matters: - MoEs are becoming the default in recent SOTA releases, yet training and deploying them still often suffers from GPU under-utilization. - Being compile-and export-friendly makes these paths easier to integrate into real production pipelines (vLLM, ONNX, OpenVINO). Are you using 🤗 Transformers with MoEs? Try it out by installing from source, or wait for the next pre-release ! Big thanks to Arthur Zucker and Steven Liu for their reviews and great feedback on polishing the API and its documentation (links in the first comment). | 13 comments on LinkedIn
https://www.linkedin.com/posts/ilyas-moutawwakil_tldr-up-to-11-faster-moe-inference-in-activity-7413936534367653888-NiiK/
Experts backends
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/docs/transformers/main/en/experts_interface

Seonglae Cho