IIA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jun 13 22:24
Editor
Edited
Edited
2025 Jun 14 1:3

Interchange-Intervention Accuracy

The ratio of cases where changing a high-level variable X and changing its corresponding low-level feature ΠX produce identical outputs. This metric measures how causally accurate the alignment is between variables and features.
A causal-based metric using counterfactual interventions
where
  • is the dataset,
  • are high-level nodes,
  • is the set of aligned low-level nodes, and
  • performs the interchange intervention.
IIA is the fraction of interventions for which the high- and low-level models agree.
 
 
 
 

MIB (Mechanistic Interpretability Benchmark)

All sets consist of (original, n counterfactuals) pairs, which clearly create situations where "outputs should be the same/should be different."
 
 
 

Recommendations