DPO

Creator
Creator
Seonglae Cho
Created
Created
2023 Sep 24 4:20
Editor
Edited
Edited
2025 Jul 2 14:46
Refs
Refs
RLHF
SFT
PPO
KTO

Direct Preference Optimization

Literally upweight the preferred response while downweight the unpreferred response, which is a really simple mechanism.
LDPO(πθ,πref)=E(x,yw,yl)D[logσ ⁣(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]\begin{aligned} L_{\mathrm{DPO}}(\pi_\theta,\pi_{\mathrm{ref}}) &= \mathbb{E}_{(x,y_w,y_l)\sim D}\Bigl[ -\log\,\sigma\!\Bigl( \beta\Bigl(\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} - \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)} \Bigr) \Bigr) \Bigr] \end{aligned}
It can learn human preferences without RL, using significantly less memory compared to the
PPO
architecture in RLHF.
The model directly incorporates user's pairwise preferences into training using a set of preference pairs and logits.
notion image
 
https://arxiv.org/pdf/2402.01306
 
 

Implementation

Self-Rewarding Language Models

DPO Datasets

sDPO from upstage

 
 

Recommendations