Continuous-Adversarial Training

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 15 15:38
Editor
Edited
Edited
2025 Jan 18 1:53
Refs
Refs

Perturbation

δt+1=δt+αsign(δlogfθ(y^x+δt))\delta_{t+1} = \delta_t + \alpha \cdot \text{sign} \left( \nabla_\delta \log f_\theta(\hat{y} \mid x + \delta_t) \right)

CAT

LCAT=logfθ(ysafex+δ)Toward Safelogfθ(yunsafex+δ)Away from UnsafeE(x,y)Dulogfθ(yx)Utility Preservation\mathcal{L}_{\text{CAT}} = \underbrace{-\log f_\theta(y_\text{safe} \mid x + \delta)}_{\text{Toward Safe}} - \underbrace{\log f_\theta(y_\text{unsafe} \mid x + \delta)}_{\text{Away from Unsafe}} - \underbrace{\mathbb{E}_{(x, y) \in D_u} \log f_\theta(y \mid x)}_{\text{Utility Preservation}}

CAPO (
DPO
)

Continuous-Adversarial Preference Optimization

LCAPO=E(x,y,y^)D[β(logfθ(yx+δ(x,y^))fθ0(yx)logfθ(y^x+δ(x,y^))fθ0(y^x))]\mathcal{L}_{\text{CAPO}} = \mathbb{E}_{(x, y, \hat{y}) \in \mathcal{D}} \left[ \ell_\beta \left( \log \frac{f_\theta(y \mid x + \delta(x, \hat{y}))}{f_{\theta_0}(y \mid x)} - \log \frac{f_\theta(\hat{y} \mid x + \delta(x, \hat{y}))}{f_{\theta_0}(\hat{y} \mid x)} \right) \right]
 
 
 
 
 
 

Recommendations