Activation Probing
Monkey Patching for LLM
Activation patching is a technique used to understand how different parts of a model contribute to its behavior. In an activation patching experiment, we modify or “patch” the activations of certain model components and observe the impact on model output.
For example, Extracting Steering Vector for CAA
- Clean run
- Corrupted run
- Patched run
Example
- Clean input “What city is the Eiffel Tower in?” → Save clean activation
- Corrupted input “What city is the Colosseum in?” → Save corrupted output
- Patch activation on the corrupted input from the clean activation → observe which activation layer or attention head is important for producing the correct answer, “Paris”
Activation Patching Methods
hands-on
Neel nanda method
Contrastive Activation Addition SVs operate unstably for specific inputs within the data in-distribution and have limitations in generalization OOD data
linear probing
runtime monitor