Activation Patching

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 14 1:10
Editor
Edited
Edited
2025 Oct 13 9:56

Activation Probing

Monkey Patching for LLM

Activation patching is a technique used to understand how different parts of a model contribute to its behavior. In an activation patching experiment, we modify or “patch” the activations of certain model components and observe the impact on model output.
For example, Extracting
Steering Vector
for
CAA
  1. Clean run
  1. Corrupted run
  1. Patched run

Example

  1. Clean input “What city is the Eiffel Tower in?” → Save clean activation
  1. Corrupted input “What city is the Colosseum in?” → Save corrupted output
  1. Patch activation on the corrupted input from the clean activation → observe which activation layer or attention head is important for producing the correct answer, “Paris”
Activation Patching Methods
 
 
 

hands-on

Activation Patching — nnsight
📗 You can find an interactive Colab version of this tutorial here.

Neel nanda method

Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda
A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable
Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda
arxiv.org
Contrastive Activation Addition SVs operate unstably for specific inputs within the data in-distribution and have limitations in generalization OOD data
arxiv.org

linear probing

arxiv.org
runtime monitor
Runtime Monitoring Neuron Activation Patterns
For using neural networks in safety critical domains, it is important to know if a decision made by a neural network is supported by prior similarities in training. We propose runtime neuron...
Runtime Monitoring Neuron Activation Patterns
 
 

Recommendations