Weight-sparse Transformers

L0-sparse models

Cross-entropy loss is slightly higher than dense models of the same scale, but circuit interpretability improves by 10-16x.

Circuit Discovery. Training cost is at least 100x, and up to 1000x more expensive compared to dense models. NVIDIA Tensor Core is optimized exclusively for dense GEMM operations

arxiv.org

https://arxiv.org/pdf/2511.13653

Understanding neural networks through sparse circuits

We trained models to think in simpler, more traceable steps—so we can better understand how they work.

https://openai.com/index/understanding-neural-networks-through-sparse-circuits/

Weight-sparse Transformers

L0-sparse models

Recommendations