MegaBlocks

Creator

Creator

Seonglae Cho

Created

Created

2026 Jan 9 1:38

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Jan 9 1:41

Refs

Refs

Rearrange the boxes into a large "block grid", skipping empty blocks (via metadata) and multiplying only the present blocks.

Represent MoE's "token groups per expert" as a block-diagonal structure of block sparse matrices.

Implement new fast SDD/DSD/DDS kernels (computation patterns depending on which of output/input is sparse).

Store sparse metadata in a BCSR-based + BCOO (row index added) hybrid format to reduce search costs like "which row is this block in?" in SDD.

For transpose access, instead of copying actual values for transpose, create transpose indices (secondary index) to enable fast reverse (transpose) iteration.

Fuse 0-padding in the permutation stage so that the number of tokens per expert becomes a multiple of block size (padding is "minimal for block alignment" rather than dropping).

Grouped matmul compared to these, particularly strong in training backward path memory

Recommendations

////////