Step-wise reasonableness as reward
without using any additional rewards such as final answer matching, number of steps, API call costs
synthetic data generation and RL methodology targeting multi-step optimization scenario
but no experiment comparing with single step RL with verifiable reward