In Reasoning Models, increasing test-time computations (thinking tokens) doesn't always lead to improvement, and reverse scaling where accuracy actually decreases has been observed across multiple tasks
arxiv.org
https://arxiv.org/pdf/2507.14417
Seonglae Cho
Seonglae Cho