L0-sparse models
Cross-entropy loss is slightly higher than dense models of the same scale, but circuit interpretability improves by 10-16x. Circuit Discovery. Training cost is at least 100x, and up to 1000x more expensive compared to dense models. NVIDIA Tensor Core is optimized exclusively for dense GEMM operations

Understanding neural networks through sparse circuits
We trained models to think in simpler, more traceable steps—so we can better understand how they work.
https://openai.com/index/understanding-neural-networks-through-sparse-circuits/


Seonglae Cho