Fundamental Interpretability, Mech-interp
Attempting to reverse engineer the neural network down to human interpretable way.
Why it is important
The fundamental difference between human intelligence and robot intelligence lies in whether their structure is based on evolution or intentional design. In the long term, this difference will become more significant, as the human brain is a black box both ethically and physically, making its true understanding difficult, while artificial intelligence gives us freedom in that we can freely access and modify its reasoning process. While this relates to AI control and safety, above all, this freedom that has been overlooked will become increasingly important as artificial intelligence advances.
For example, hallucinations in robotics models pose significant dangers, unlike language models' hallucinations, which merely provide incorrect information. Mechanistic interpretability provides a promising and explicit method to control AI.
Pros
- Investing in model architecture now may save a lot of interpretability effort in the future.
- Any group owning an LLM will want to understand its inner workings to increase trust with clients.
Challenges
One of the core challenges of mechanistic interpretability is to make neural network parameters meaningful by contextualizing them.
Mechanistic interpretability Theory
Mechanistic interpretability Types
Mechanistic interpretability Usages
A Mathematical Framework for Transformer Circuits
Transformer language models are an emerging technology that is gaining increasingly broad real-world use, for example in systems like GPT-3 , LaMDA , Codex , Meena , Gopher , and similar models. However, as these models scale, their open-endedness and high capacity creates an increasing scope for unexpected and sometimes harmful behaviors. Even years after a large model is trained, both creators and users routinely discover model capabilities – including problematic behaviors – they were previously unaware of.
https://transformer-circuits.pub/2021/framework/index.html
Challenges
200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum
EDIT 19/7/24: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend…
https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong
Why we made this list: • * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on…
https://www.lesswrong.com/posts/KfkpgXdgRheSRWDy8/a-list-of-45-mech-interp-project-ideas-from-apollo-research

Open problems with activation engineering
Circuits Updates - April 2024
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
https://transformer-circuits.pub/2024/april-update/index.html#attr-dl
AI Safety
Mechanistic Interpretability for AI Safety -- A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse engineering the computational mechanisms...
https://arxiv.org/abs/2404.14082

Overlook
According to Chaos Theory, since Artificial Intelligence is a Complex System, it is fundamentally more efficient and useful to observe macroscopic and statistical patterns (higher-level characteristics). Attempts to "completely analyze" AI models at the neuron/circuit level have yielded few practical results due to the complex systemic nature of AI. There is an argument to reduce excessive resource investment in mechanical interpretation and focus more on higher-level interpretation and control research that can produce tangible results.
The Misguided Quest for Mechanistic AI Interpretability | AI Frontiers
Dan Hendrycks, May 15, 2025 — Despite years of effort, mechanistic interpretability has failed to provide insight into AI behavior — the result of a flawed foundational assumption.
https://www.ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability

Pragmatic Interpretability
The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.
Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.
This is where the criticism of SAEs appears again: means often become ends. It's easy to stop at "we saw something with SAE." Be wary of using SAE when simpler methods would work. Does this actually help us understand the model better? Or did we just extract a lot of features?
A Pragmatic Vision for Interpretability — AI Alignment Forum
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engi…
https://www.alignmentforum.org/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability

Dead Salmon
false positive). A famous example: researchers put a dead salmon in an fMRI scanner, showed it pictures of people, and found brain regions that were "statistically significant" for responding to social emotions. Of course, the salmon wasn't thinking. The issue was that testing tens of thousands of voxels simultaneously without proper multiple comparisons correction (FDR/Bonferroni, etc.) caused random noise to appear "significant." This phenomenon of spurious findings is now called the "dead salmon artifact."

Seonglae Cho