Direct Preference Optimization
Literally upweight the preferred response while downweight the unpreferred response, which is a really simple mechanism.
It can learn human preferences without RL, using significantly less memory compared to the PPO architecture in RLHF.
The model directly incorporates user's pairwise preferences into training using a set of preference pairs and logits.