Weight-sparse Transformers

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 26 13:32
Editor
Edited
Edited
2025 Nov 26 13:34

L0-sparse models

Cross-entropy loss is slightly higher than dense models of the same scale, but circuit interpretability improves by 10-16x.
Circuit Discovery
. Training cost is at least 100x, and up to 1000x more expensive compared to dense models. NVIDIA Tensor Core is optimized exclusively for dense GEMM operations
notion image
 
 
 
 
 

Recommendations