All 18 metrics in NeuronEval evaluate the "faithfulness" of explanations by measuring how well predicted activations match actual unit (neuron/SAE Feature ) activation patterns. Introduction of two sanity checks: "missing label test" and "excessive label test" for 18 evaluation metrics. Out of these 18 stand-alone metrics, only five passed both sanity checks (missing label test and excessive label test): F1-score, IoU, Pearson Correlation, Cosine Similarity, and AUPRC are recommended as "reliable" metrics.


Datasets used to compare "which of the 18 metrics work well"
- (Vision)
- ImageNet (1,000 classes)
- Places365 (365 place categories)
- CUB-200-2011 (200 bird species dataset; 112 detailed attribute feature labels)
- Language
- OpenWebText (for GPT-2 model evaluation, limited to 500 frequently occurring tokens)