Activation Patching

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 14 1:10
Editor
Edited
Edited
2024 Oct 14 11:7
Refs
Refs

Extracting
Steering Vector
for Contrastive Activation Addition

  1. Clean run
  1. Corrupted run
  1. Patched run

Example

  1. Clean input “What city is the Eiffel Tower in?” → Save clean activation
  1. Corrupted input “What city is the Colosseum in?” → Save corrupted output
  1. Patch activation on the corrupted input from the clean activation → observe which activation layer or attention head is important for producing the correct answer, “Paris”
 
 
 

Neel nanda method

Contrastive Activation Addition SVs operate unstably for specific inputs within the data in-distribution and have limitations in generalization OOD data
 
 

Recommendations