Rearrange the boxes into a large "block grid", skipping empty blocks (via metadata) and multiplying only the present blocks.
- Represent MoE's "token groups per expert" as a block-diagonal structure of block sparse matrices.
- Implement new fast SDD/DSD/DDS kernels (computation patterns depending on which of output/input is sparse).
- Store sparse metadata in a BCSR-based + BCOO (row index added) hybrid format to reduce search costs like "which row is this block in?" in SDD.
- For transpose access, instead of copying actual values for transpose, create transpose indices (secondary index) to enable fast reverse (transpose) iteration.
- Fuse 0-padding in the permutation stage so that the number of tokens per expert becomes a multiple of block size (padding is "minimal for block alignment" rather than dropping).
Batched matmul Grouped matmul compared to these, particularly strong in training backward path memory

Seonglae Cho