Direct Preference Optimization
Literally upweight the preferred response while downweight the unpreferred response, which is a really simple mechanism.
It can learn human preferences without RL, using significantly less memory compared to the PPO architecture in RLHF.
The model directly incorporates user's pairwise preferences into training using a set of preference pairs and logits.

Implementation
Self-Rewarding Language Models
DPO Datasets
sDPO from upstage
Toxicity reduction interpretation
DPO reduces toxicity not through a few neurons, but via distributed activation shifts across all MLP neurons. DPO operates through balanced action of four neuron groups:
- TP↓: Toxicity-aligned + positive activation → decrease
- TN↓: Toxicity-aligned + negative activation → decrease
- AP↓: Anti-toxicity aligned + positive activation → increase (anti-toxicity reinforcement)
- AN↓: Anti-toxicity aligned + negative activation → increase (anti-toxicity reinforcement)
Patching activations of all four groups to post-DPO values reproduces or exceeds DPO effects. In contrast, patching only toxic neurons has minimal effect.
suppresses a few toxic neurons

Seonglae Cho