Optimal Reasoning Length

Chain-of-Thought (CoT) length is not a case of "longer is better," but rather follows an inverted U-curve where accuracy initially increases but then decreases after a certain length. This indicates there is an optimal length.

If it's too short (underthinking), complex aspects can't be properly decomposed; if too long (overthinking), cumulative errors increase and performance drops. During RL training (e.g., GRPO, PPO), the average CoT length naturally converges toward becoming shorter → the reward maximization process finds the optimal length, revealing a simplicity bias.

arxiv.org

https://arxiv.org/pdf/2502.07266v3

Optimal Reasoning Length

Recommendations