IRPO

Created
Created
2024 May 7 11:41
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2024 May 7 11:42
Refs
Refs

Iterative Reasoning Preference Optimization

This method repeatedly produces potential chains of thought, forms preference pairs utilizing correct responses, and utilizes a modified DPO + NLL for training. This approach significantly enhances accuracy without the need for additional data.
 
 
 
 
 
 
 

Recommendations