Cross-Model Patching
Training for Mapping
Explaining smaller models using larger models
If the representations between two models are different, an optimal mapping function such as a linear transformation matrix is learned using specific training data to reduce that difference.
Expressive Decoding
A technique that, unlike SAE, doesn't require training and can accurately explain activation vectors in natural language without using only partial values like LLM explainers do
- source prompt
- inspection prompt
Multi-hop patching
A process where each individual step is correct but their connection fails, requiring extraction of the model's intermediate representations from specific layers and patching them into other layers to derive the correct answer
for example
For a question like "What is the largest city in the country where sushi originated?", the model needs to first recognize that sushi originated in Japan, and then determine that Tokyo is Japan's largest city
By extracting hidden representations and injecting them into appropriate layers, we can help the model continue its reasoning correctly.
‣