Mechanistic interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 17 13:50
Editor
Edited
Edited
2024 Nov 25 22:16

Fundamental Interpretability, Mech-interp

Attempting to reverse engineer the neural network down to human interpretable way.

Pros

  • Investing in model architecture now may save a lot of interpretability effort in the future.
  • Any group owning an LLM will want to understand its inner workings to increase trust with clients.

Challenges

One of the core challenges of mechanistic interpretability is to make neural network parameters meaningful by contextualizing them.
Mechanistic interpretability Notion
 
 
 

Chris Olah

Neel Nanda

The field of study of reverse engineering neural networks from the learned weights down to human-interpretable algorithms. Analogous to reverse engineering a compiled program binary back to source code.

AI Safety

Reading list

Overlook

reaserch

 
 

Recommendations