SWiRL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 27 1:0
Editor
Edited
Edited
2025 Apr 27 1:4
Refs
Refs

Step-wise reasonableness as reward

without using any additional rewards such as final answer matching, number of steps, API call costs
synthetic data generation and RL methodology targeting multi-step optimization scenario
but no experiment comparing with single step RL with verifiable reward
 
 
 
 
 

Recommendations