linear hypothesis - correlation based feature analysis
We choose in intervene highest layer activation since the stereotype is highly dependent on the context the attention sink indicates the higher layer shows high context dependent tokens.
approach attempted (fine-tuning SAE)
dataset itself has some bias since we couldn’t determined why the sae reinforced some stereotype distinction while lower others. It is not related to data sample count so the explicitness of bias also one candidate of cause. Fine-tuning SAE to specific dataset does helpful some specific area but we couldn’t achieved a general improvement.
Seonglae Cho