Faithfulness Interpretability

Creator

Creator

Seonglae Cho

Created

Created

2025 Jan 18 0:49

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Apr 6 17:57

Refs

Refs

The decomposition should identify a set of components that sum to the parameters of the original network

aclanthology.org

https://aclanthology.org/2022.acl-long.345.pdf

Comprehensiveness & Plausibility

aclanthology.org

https://aclanthology.org/2020.acl-main.408.pdf

Circuit Discovery

Transformer Circuit Evaluation Metrics Are Not Robust

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' - subgraphs of the full...

https://openreview.net/forum?id=zSf8PJyQb2#discussion

Transformer Circuit Evaluation Metrics Are Not Robust

Recommendations

////////