Huggingface Expert Implementation

Creator

Creator

Seonglae Cho

Created

Created

2026 Jan 7 11:53

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Feb 4 14:25

Refs

Refs

MoE Implementation

torch.nn.functional.grouped_mm

Problem: non-deterministic

Backward memory overhead is huge

batched and grouped experts implementations

Updated 2026 Jan 5 10:5

TL;DR: up to 11× faster MoE inference in Transformers 🤯 Excited to share that my PR on optimizing the experts implementation for MoEs just got merged into 🤗 Hugging Face Transformers. We added a… | Ilyas Moutawwakil | 13 comments

TL;DR: up to 11× faster MoE inference in Transformers 🤯 Excited to share that my PR on optimizing the experts implementation for MoEs just got merged into 🤗 Hugging Face Transformers. We added a new API for MoEs, with two new optimized paths to choose from: *grouped_mm* and *batched_mm*. These paths achieve substantial inference speedups compared to the previous eager implementation and behave much better under PyTorch compilation. In practice, the new backends can typically provide ~6–11× speedups when used appropriately: - *grouped_mm* is the best all-rounder: it’s overall the fastest option and scales very well for large input sizes, which makes it the default choice in most scenarios. However, it comes with some requirements, such as PyTorch 2.9 and bfloat16 when compiled. - *batched_mm* was originally meant for torch.export compatibility, but it demonstrated an incredible edge for small input sizes. It works especially well when combined with torch.compile, as it benefits the most from maximum auto-tuning, making it a great option for decoding workloads. The attached plot summarizes this; full benchmarks and reproduction scripts can be found in the PR and documentation (links in the first comment). Why this matters: - MoEs are becoming the default in recent SOTA releases, yet training and deploying them still often suffers from GPU under-utilization. - Being compile-and export-friendly makes these paths easier to integrate into real production pipelines (vLLM, ONNX, OpenVINO). Are you using 🤗 Transformers with MoEs? Try it out by installing from source, or wait for the next pre-release ! Big thanks to Arthur Zucker and Steven Liu for their reviews and great feedback on polishing the API and its documentation (links in the first comment). | 13 comments on LinkedIn

TL;DR: up to 11× faster MoE inference in Transformers 🤯 Excited to share that my PR on optimizing the experts implementation for MoEs just got merged into 🤗 Hugging Face Transformers. We added a… | Ilyas Moutawwakil | 13 comments

https://www.linkedin.com/posts/ilyas-moutawwakil_tldr-up-to-11-faster-moe-inference-in-activity-7413936534367653888-NiiK/

TL;DR: up to 11× faster MoE inference in Transformers 🤯 Excited to share that my PR on optimizing the experts implementation for MoEs just got merged into 🤗 Hugging Face Transformers. We added a… | Ilyas Moutawwakil | 13 comments

Experts backends

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/transformers/main/en/experts_interface

Experts backends

Recommendations

////////