DPO

Direct Preference Optimization

Literally upweight the preferred response while downweight the unpreferred response, which is a really simple mechanism.

\begin{aligned} L_{\mathrm{DPO}}(\pi_\theta,\pi_{\mathrm{ref}}) &= \mathbb{E}_{(x,y_w,y_l)\sim D}\Bigl[ -\log\,\sigma\!\Bigl( \beta\Bigl(\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} - \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)} \Bigr) \Bigr) \Bigr] \end{aligned}

It can learn human preferences without RL, using significantly less memory compared to the

PPO architecture in RLHF.

The model directly incorporates user's pairwise preferences into training using a set of preference pairs and logits.


{'prompt': '<|im_start|>system\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer.<|im_end|>\n<|im_start|>user\nGenerate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One<|im_end|>\n<|im_start|>assistant\n',
'chosen': 'Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.<|im_end|>\n',
'rejected': ' Sure! Here\'s a sentence that describes all the data you provided:\n\n"Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes."<|im_end|>\n'}

DPO

Direct Preference Optimization

Implementation

Self-Rewarding Language Models

DPO Datasets

sDPO from upstage

Backlinks

Recommendations