ReFAT

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 18 0:24
Editor
Edited
Edited
2025 Mar 13 16:32
Refs
Refs

Refusal Feature Adversarial Training

RFA (refusal feature ablation) Approximates worst-cast activation Perturbations.
 
 
 
 
They observed significant performance degradation when the refusal direction was simply zeroed out, potentially due to the resulting out-of-distribution behavior
 
 

Recommendations