The decomposition should use as few components as possible to replicate the network’s behavior on its training distribution
Interpretability in Parameter Space: Minimizing Mechanistic...
Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best to decompose neural...
https://publications.apolloresearch.ai/apd


Seonglae Cho