DPO

Creator
Creator
Seonglae Cho
Created
Created
2023 Sep 24 4:20
Editor
Edited
Edited
2025 Jul 2 14:46
Refs
Refs
RLHF
SFT
PPO
KTO

Direct Preference Optimization

Literally upweight the preferred response while downweight the unpreferred response, which is a really simple mechanism.
LDPO(πθ,πref)=E(x,yw,yl)D[logσ ⁣(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]\begin{aligned} L_{\mathrm{DPO}}(\pi_\theta,\pi_{\mathrm{ref}}) &= \mathbb{E}_{(x,y_w,y_l)\sim D}\Bigl[ -\log\,\sigma\!\Bigl( \beta\Bigl(\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} - \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)} \Bigr) \Bigr) \Bigr] \end{aligned}
It can learn human preferences without RL, using significantly less memory compared to the
PPO
architecture in RLHF.
The model directly incorporates user's pairwise preferences into training using a set of preference pairs and logits.
notion image
{'prompt': '<|im_start|>system\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer.<|im_end|>\n<|im_start|>user\nGenerate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One<|im_end|>\n<|im_start|>assistant\n', 'chosen': 'Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.<|im_end|>\n', 'rejected': ' Sure! Here\'s a sentence that describes all the data you provided:\n\n"Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes."<|im_end|>\n'}
 
https://arxiv.org/pdf/2402.01306
 
 

Implementation

Self-Rewarding Language Models

DPO Datasets

sDPO from upstage

 
 

Recommendations