Circuit Stability

The degree to which a language model uses consistent computational circuits across diverse inputs for a task

Circuit Stability Characterizes Language Model Generalization

The author uses attention patching for circuit extraction, but unlike previous approaches that use binary hard circuits, they introduce Soft Circuits which make the analysis more tractable and enable numerical analysis. Circuit Importance is measured as the change in KL Loss with circuits that are represented as a graph G = (V, E) (Elhage-style). While basically every edge is connected in the full graph, they prune based on a threshold and observe significant changes in behavior at specific range.

He then creates an adjacency matrix by binarizing the circuit similarity (Spearman ρ) between subtasks. Five circuit families including equal-digit, single-digit, one-digit-diff, etc., naturally separate in the matrix clustering results. Consequently, each group represents a type of problem that reuses a consistent circuit structure.

arxiv.org

https://arxiv.org/pdf/2505.24731

Circuit Stability

The degree to which a language model uses consistent computational circuits across diverse inputs for a task

Circuit Stability Characterizes Language Model Generalization

Recommendations