Iterative Reasoning Preference Optimization
This method repeatedly produces potential chains of thought, forms preference pairs utilizing correct responses, and utilizes a modified DPO + NLL for training. This approach significantly enhances accuracy without the need for additional data.
arxiv.org
https://arxiv.org/pdf/2404.19733

Seonglae Cho