Perturbationδt+1=δt+α⋅sign(∇δlogfθ(y^∣x+δt))\delta_{t+1} = \delta_t + \alpha \cdot \text{sign} \left( \nabla_\delta \log f_\theta(\hat{y} \mid x + \delta_t) \right)δt+1=δt+α⋅sign(∇δlogfθ(y^∣x+δt))CATLCAT=−logfθ(ysafe∣x+δ)⏟Toward Safe−logfθ(yunsafe∣x+δ)⏟Away from Unsafe−E(x,y)∈Dulogfθ(y∣x)⏟Utility Preservation\mathcal{L}_{\text{CAT}} = \underbrace{-\log f_\theta(y_\text{safe} \mid x + \delta)}_{\text{Toward Safe}} - \underbrace{\log f_\theta(y_\text{unsafe} \mid x + \delta)}_{\text{Away from Unsafe}} - \underbrace{\mathbb{E}_{(x, y) \in D_u} \log f_\theta(y \mid x)}_{\text{Utility Preservation}}LCAT=Toward Safe−logfθ(ysafe∣x+δ)−Away from Unsafelogfθ(yunsafe∣x+δ)−Utility PreservationE(x,y)∈Dulogfθ(y∣x)CAPO (DPO)Continuous-Adversarial Preference OptimizationLCAPO=E(x,y,y^)∈D[ℓβ(logfθ(y∣x+δ(x,y^))fθ0(y∣x)−logfθ(y^∣x+δ(x,y^))fθ0(y^∣x))]\mathcal{L}_{\text{CAPO}} = \mathbb{E}_{(x, y, \hat{y}) \in \mathcal{D}} \left[ \ell_\beta \left( \log \frac{f_\theta(y \mid x + \delta(x, \hat{y}))}{f_{\theta_0}(y \mid x)} - \log \frac{f_\theta(\hat{y} \mid x + \delta(x, \hat{y}))}{f_{\theta_0}(\hat{y} \mid x)} \right) \right]LCAPO=E(x,y,y^)∈D[ℓβ(logfθ0(y∣x)fθ(y∣x+δ(x,y^))−logfθ0(y^∣x)fθ(y^∣x+δ(x,y^)))] Continuous-AdvTrainsophie-xhonneux • Updated 2024 Dec 14 8:8arxiv.orghttps://arxiv.org/pdf/2405.15589