ReFAT

Creator

Seonglae Cho

Created

2025 Jan 18 0:24

Editor

Seonglae Cho

Edited

2025 Mar 13 16:32

Refs

Refusal Feature Adversarial Training

RFA (refusal feature ablation) Approximates worst-cast activation Perturbations.

They observed significant performance degradation when the refusal direction was simply zeroed out, potentially due to the resulting out-of-distribution behavior

arxiv.org

https://arxiv.org/pdf/2409.20089

Recommendations

/////////