Refusal Feature Adversarial Training
RFA (refusal feature ablation) Approximates worst-cast activation Perturbations.
They observed significant performance degradation when the refusal direction was simply zeroed out, potentially due to the resulting out-of-distribution behavior