SimPO: Simple Preference Optimization with a Reference-Free RewardDirect Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to...https://arxiv.org/abs/2405.14734