Patchscopes

Cross-Model Patching

Training for Mapping

Explaining smaller models using larger models

If the representations between two models are different, an optimal mapping function such as a linear transformation matrix is learned using specific training data to reduce that difference.

Tuned Lens

Logit Lens

Expressive Decoding

A technique that, unlike SAE, doesn't require training and can accurately explain activation vectors in natural language without using only partial values like LLM explainers do

source prompt

inspection prompt

Multi-hop patching

A process where each individual step is correct but their connection fails, requiring extraction of the model's intermediate representations from specific layers and patching them into other layers to derive the correct answer

for example

For a question like "What is the largest city in the country where sushi originated?", the model needs to first recognize that sushi originated in Japan, and then determine that Tokyo is Japan's largest city

By extracting hidden representations and injecting them into appropriate layers, we can help the model continue its reasoning correctly.

Can Large Language Models Explain Their Internal Mechanisms?

An interactive introduction to Patchscopes, an inspection framework for explaining the hidden representations of large language models, with large language models.