AI Overthinking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 19 23:46
Editor
Edited
Edited
2026 Mar 11 1:48

AI Reasoning Length, Optimal Reasoning Length

 
 
 
 
 

Benchmark

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Users employ Large Language Models (LLMs) for a diverse array of tasks, ranging from answering simple factual queries to writing code or solving difficult math problems. However, for a long time, LLMs struggled with the underthinking problem wherein they could generate fluent text and answer simple queries but their performance would often fall short when tackling challenging reasoning problems that required step-by-step thinking (Wei et al., 2022). This situation has improved drastically in the past year as an emerging class of thinking models has shown remarkable performance on these more complex tasks (DeepSeek-AI et al., 2025; OpenAI et al., 2024). Although increased thinking has generally improved domains such as math and code (Muennighoff et al., 2025; Aggarwal and Welleck, 2025), its benefit for simpler queries is limited and can sometimes even lead to performance degradation (Cuadron et al., 2025; Chen et al., 2025a; Gema et al., 2025). Beyond diminishing gains, this phenomenon of overthinking in simple tasks also introduces significant latency, thus increasing the total cost of API-based models and affecting the user experience.

When More is Less

Chain-of-Thought (CoT) length is not a case of "longer is better," but rather follows an inverted U-curve where accuracy initially increases but then decreases after a certain length. This indicates there is an optimal length.
If it's too short (underthinking), complex aspects can't be properly decomposed; if too long (overthinking), cumulative errors increase and performance drops. During RL training (e.g., GRPO, PPO), the average CoT length naturally converges toward becoming shorter → the reward maximization process finds the optimal length, revealing a simplicity bias.
arxiv.org

Manifold Steering
Manifold_Steering
Aries-iaiUpdated 2026 Feb 9 12:50

LLM overthinking exists in a low-dimensional manifold of the activation space, and by aligning and intervening along it. tokens can be significantly reduced while maintaining accuracy. Manifold Steering: Estimate the low-dimensional subspace of reasoning activations using PCA, and steer only along it. Overthinking is not a single direction but a phenomenon bound to a low-dimensional manifold. Results: Token reduction of up to ~71% across math, code, and QA tasks, with accuracy maintained or slightly improved.
arxiv.org

hot-mess-of-ai
haeggeeUpdated 2026 Mar 10 17:5

When AI fails, it's more likely to fail as an inconsistent "hot mess" rather than as a dangerous agent consistently pursuing the wrong goal. Model errors can be decomposed into Bias (consistently wrong in the same way → systematic misalignment) and Variance (wrong in different ways each time → incoherent confusion). The proportion of variance in errors is defined as an incoherence metric. The longer the reasoning and the more difficult the task, the more incoherent the errors become. The more thinking or actions taken, the more random the failures.
AI Overthinking
significantly increases this incoherence.
The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?
When AI systems fail, will they fail by systematically pursuing the wrong goals, or by being a hot mess? We decompose the errors of frontier reasoning models into bias (systematic) and variance (incoherent) components and find that, as tasks get harder and reasoning gets longer, model failures become increasingly dominated by incoherence rather than systematic misalignment.
The Hot Mess of AI: How Does Misalignment Scale With Model...
As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how...
The Hot Mess of AI: How Does Misalignment Scale With Model...

Think Deep, Not Just Long

Deep-Thinking Ratio (DTR): At each generation step t, the hidden state of each layer l is unembedded directly to produce (the next token distribution from intermediate layers), and the Jensen–Shannon divergence (JSD) with the final layer distribution is computed. The "settling depth c_t" is defined as when a token has settled (converged): the first layer satisfying converges only in deep layer regions (e.g., top 15%), that token is considered a deep-thinking token. The ratio of deep-thinking tokens among all tokens is the Deep-Thinking Ratio (DTR). The correlation between output length and accuracy is on average negative (longer outputs are more likely to be incorrect).
arxiv.org
 
 

Recommendations