Fundamental Interpretability, Mech-interp
Attempting to reverse engineer the neural network down to human interpretable way.
Why it is important
The fundamental difference between human intelligence and robot intelligence lies in whether their structure is based on evolution or intentional design. In the long term, this difference will become more significant, as the human brain is a black box both ethically and physically, making its true understanding difficult, while artificial intelligence gives us freedom in that we can freely access and modify its reasoning process. While this relates to AI control and safety, above all, this freedom that has been overlooked will become increasingly important as artificial intelligence advances.
For example, hallucinations in robotics models pose significant dangers, unlike language models' hallucinations, which merely provide incorrect information. Mechanistic interpretability provides a promising and explicit method to control AI.
Pros
- Investing in model architecture now may save a lot of interpretability effort in the future.
- Any group owning an LLM will want to understand its inner workings to increase trust with clients.
Challenges
One of the core challenges of mechanistic interpretability is to make neural network parameters meaningful by contextualizing them.
Mechanistic interpretability Theory
Mechanistic interpretability Types
Mechanistic interpretability Usages