SWiRL

Creator

Created

2025 Apr 27 1:0

Editor

Edited

2025 Apr 27 1:4

Refs

without using any additional rewards such as final answer matching, number of steps, API call costs

synthetic data generation and RL methodology targeting multi-step optimization scenario

but no experiment comparing with single step RL with verifiable reward

///////