Weight-sparse Transformers

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 26 13:32
Editor
Edited
Edited
2025 Nov 26 13:34
Refs

L0-sparse models

Cross-entropy loss is slightly higher than dense models of the same scale, but circuit interpretability improves by 10-16x.
Circuit Discovery
. Training cost is at least 100x, and up to 1000x more expensive compared to dense models. NVIDIA Tensor Core is optimized exclusively for dense GEMM operations
notion image
 
 
 
 
arxiv.org
Understanding neural networks through sparse circuits
We trained models to think in simpler, more traceable steps—so we can better understand how they work.
Understanding neural networks through sparse circuits
 

Recommendations