Interpretive Equivalence

The concept of distinguishability reminds me of the Turing test.

Circuit Sufficient Condition

This means that the circuit completely reproduces the behavior of the original model hθ. Intervention can be viewed as an addition to the computational graph, or does hard intervention delete the previous?

Abstraction

Abstraction is a function that summarizes a complex causal model K1 into a simpler model K2. (observational consistency / interventional consistency)

Observational consistency: When multiple variables calculated in the large circuit are collected and the mapping π is applied, you should be able to obtain the exact variable values of the small circuit.

Intervention

Intervention consistency: When an intervention is made in the large circuit, its effect should appear as an intervention with the same meaning in the small circuit.

Fixing a specific node makes it independent of its input parent; hard intervention is equivalent to activation patching, and this paper only deals with hard intervention

Representation

Representations are causal models that transform circuit k into a simpler linear chain form.

Representation = a common format (canonical form) created for alignment/comparison between two models.

Signal condition - The original circuit's hidden activation vk must be able to be "sent" well to the representation space hk. = The representation must contain the original information without loss.

Noise condition - The representation hk must be able to be restored back to the original hidden activation vk. = It must be recoverable from the representation back to the original value.

Interpretation

Trivial Causal Model

A trivial causal model is the simplest, coarsest interpretation and most self-evident form of causal model mentioned to show that "a circuit always exists." It doesn't consider any complex causal relationships, but simply mimics the input-output relationship of the original model.

Using the minimal fact that "we can think of the original model as just a function," we can always create at least one causal model → this is the trivial circuit. The trivial circuit is a baseline that guarantees existence. In the case of a trivial causal model, compression is infinite.

Interpretation Fidelity

Comparison between model and interpretation, where the interpretation output is close to the model output.

In other words, activations are the actual intermediate values of the model (like MLP attention output activation), and representations are mappings to somewhere (like SAE dictionary latent features).

Interpretation A must be an abstraction of circuit K, and the output error must be small (η-faithful).

Implementation is a set of circuits - the set of all possible circuits that satisfy a single interpretation A.

Representation Distance

Linear transformability distance - described in summed or normed form, including Lipschitz and operator norm conditions. A metric between two representations that aligns each level's hidden representation with a single global linear map A and then minimizes the difference.

Implementation

The distance between implementations is

Compression and Ambiguity Score

Compression is the diameter measuring how spread out the implementations are - a metric for a single interpretation. When comparing two interpretations, in actual experiments, circuit distance cannot be measured, so representation distance is used as a proxy. Then compression uses representation distance.

Ambiguity Score

Ambiguity score = "how far apart are the circuit implementations that satisfy interpretation A from each other?" Average representation distance. It calculates the probability that B's implementation is not "closer" than A:

Interpretive Equivalence says two interpretations are equivalent if their

Hausdorff distance is small.

GetImpl

"A function that generates another implementation of the current model"

Find the functionally important attention head set P

Based on edge-attribution / mediation analysis

Heads whose ablation degrades performance are in P

The rest are in N

Generate two types of interventions

Rotation: Rotate P heads' activations within the same subspace→ Interpretation is maintained (basis change)

Deletion/Patching: Replace some of N with counterfactual activations→ Randomly change only unimportant structures→ Generate a "different model" in the implementation space

The model created this way is a new implementation that maintains the same interpretation.

By repeating this process, we sample the model's "implementation set".

Experiments

1. n-Permutation (RASP-based toy transformer)

2.
Indirect object identification (GPT-2 small/medium, Pythia 160M–2.8B)

3. POS tagging vs next-token prediction

Main Result 1

Interpretive equivalence ⇒ All representation distances are small (upper bound)

Main Result 2

Representation distances are sufficiently small ⇒ Interpretive equivalence (lower bound)

Main Result 3

"Upper bound on the probability that two models have the same interpretation"

Deterministic Causal Model

A (deterministic) causal model with components is a quadruple where is a set of hidden variables such that , is an input variable, and defines a partial ordering over .

Hidden variables are intermediate computation results or activations that indicate the model state and represent nodes in the computational graph. Functions are considered as edges with a partial order that connects them to form a

DAG computational graph.

For each

where and are the parents of .

arxiv.org

https://arxiv.org/pdf/2301.04709