NeuronEval

Creator

Creator

Seonglae Cho

Created

Created

2025 Jul 1 14:54

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Jul 1 17:30

Refs

Refs

Trustworthy-ML-Lab • Updated 2025 Jul 1 9:10

All 18 metrics in NeuronEval evaluate the "faithfulness" of explanations by measuring how well predicted activations match actual unit (neuron/

SAE Feature ) activation patterns. Introduction of two sanity checks: "missing label test" and "excessive label test" for 18 evaluation metrics. Out of these 18 stand-alone metrics, only five passed both sanity checks (missing label test and excessive label test): F1-score, IoU, Pearson Correlation, Cosine Similarity, and AUPRC are recommended as "reliable" metrics.

notion image

notion image

Datasets used to compare "which of the 18 metrics work well"

(Vision)

ImageNet (1,000 classes)

Places365 (365 place categories)

CUB-200-2011 (200 bird species dataset; 112 detailed attribute feature labels)

Language

OpenWebText (for GPT-2 model evaluation, limited to 500 frequently occurring tokens)

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

https://lilywenglab.github.io/Neuron_Eval/

https://arxiv.org/pdf/2506.05774

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

https://lilywenglab.github.io/Neuron_Eval/

Backlinks

Activation Auditing

Recommendations

//////////