Fundamental Interpretability, Mech-interp
Attempting to reverse engineer the neural network down to human interpretable way.
Pros
- Investing in model architecture now may save a lot of interpretability effort in the future.
- Any group owning an LLM will want to understand its inner workings to increase trust with clients.
Challenges
One of the core challenges of mechanistic interpretability is to make neural network parameters meaningful by contextualizing them.
Mechanistic interpretability Notion
Chris Olah
Neel Nanda
The field of study of reverse engineering neural networks from the learned weights down to human-interpretable algorithms. Analogous to reverse engineering a compiled program binary back to source code.