Input-centric methods (based on inputs that activate features) fail to reflect the impact of features on outputs. They have high computational costs and strong data dependencies. An ensemble approach is used that projects feature vectors into vocabulary space to analyze top tokens using both input and output, or analyzes tokens with large output probability changes during feature amplification. This provides high computational efficiency, and dead features can also be activated through output-centric methods.
VocabProj
Creator
Creator
Seonglae ChoCreated
Created
2025 Jan 16 15:7Editor
Editor
Seonglae ChoEdited
Edited
2025 Jan 16 17:16Refs
Refs