IIA

Interchange-Intervention Accuracy

The ratio of cases where changing a high-level variable X and changing its corresponding low-level feature ΠX produce identical outputs. This metric measures how causally accurate the alignment is between variables and features.

A causal-based metric using counterfactual interventions

where

is the dataset,

are high-level nodes,

is the set of aligned low-level nodes, and

performs the interchange intervention.

IIA is the fraction of interventions for which the high- and low-level models agree.

MIB (Mechanistic Interpretability Benchmark)

All sets consist of (original, n counterfactuals) pairs, which clearly create situations where "outputs should be the same/should be different."

mib-bench (Mechanistic Interpretability Benchmark)

Principled evaluation of mechanistic interpretability methods.

https://huggingface.co/mib-bench

mib-bench (Mechanistic Interpretability Benchmark)

arxiv.org

https://arxiv.org/pdf/2504.13151

IIA

Interchange-Intervention Accuracy

MIB (Mechanistic Interpretability Benchmark)

Backlinks

Recommendations