L0-sparse models
Cross-entropy loss is slightly higher than dense models of the same scale, but circuit interpretability improves by 10-16x. Circuit Discovery. Training cost is at least 100x, and up to 1000x more expensive compared to dense models. NVIDIA Tensor Core is optimized exclusively for dense GEMM operations


Seonglae Cho