Iterative Reasoning Preference Optimization
This method repeatedly produces potential chains of thought, forms preference pairs utilizing correct responses, and utilizes a modified DPO + NLL for training. This approach significantly enhances accuracy without the need for additional data.