Interchange-Intervention Accuracy
The ratio of cases where changing a high-level variable X and changing its corresponding low-level feature ΠX produce identical outputs. This metric measures how causally accurate the alignment is between variables and features.
A causal-based metric using counterfactual interventions
where
- is the dataset,
- are high-level nodes,
- is the set of aligned low-level nodes, and
- performs the interchange intervention.
IIA is the fraction of interventions for which the high- and low-level models agree.
MIB (Mechanistic Interpretability Benchmark)
All sets consist of (original, n counterfactuals) pairs, which clearly create situations where "outputs should be the same/should be different."