CorrSteer Method

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 16 17:16
Editor
Edited
Edited
2025 Feb 18 14:45
Refs
Refs
linear hypothesis - correlation based feature analysis
 
 
 
We choose in intervene highest layer activation since the stereotype is highly dependent on the context the attention sink indicates the higher layer shows high context dependent tokens.
 
 

approach attempted (fine-tuning SAE)

dataset itself has some bias since we couldn’t determined why the sae reinforced some stereotype distinction while lower others. It is not related to data sample count so the explicitness of bias also one candidate of cause. Fine-tuning SAE to specific dataset does helpful some specific area but we couldn’t achieved a general improvement.
 
 
 

Recommendations