Reasoning Model

Test-time Compute, Inference time compute

Pretraining Scaling is reaching its limits due to finite data.

Reasoning Model scaling through
CoT and
AI Agent evolves reasoning capability.

Due to
Compounding Error, increasing threat to ensuring
AI Alignment

Test-time compute is important because when solving problems, AI need complexity proportional to the algorithm itself - giving an immediate answer just means reciting memorized information. However, among human requests there are new queries, and we need to first determine how much thinking is required based on the problem complexity. This can be also done through pattern matching for problems which also enables intelligence extrapolation.

Test-time Compute Notion

LLM Reasoning Model

Test-time Prompting

Reasoning Interpretability

Reasoning Benchmark

Test-time Regression

Reasoning Model Explainability

AI Overthinking

Test-time Scaling

Reasoning Compounding Error

AI Reasoning Methods

AI Think Tool

Visual Thought

Mind Evolution

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Pre-training, mid-training, and RL post-training each contribute to "reasoning capability" in different ways, which can be causally decomposed in a fully controlled (synthetic) setting. The approach uses

DAG to create reasoning structures and templates (e.g., animal-zoo, teacher-school) to vary only surface context. A process-verified evaluation is used by parsing the model's output CoT and comparing the predicted graph vs. the ground truth graph.

Generalization type

Extrapolative (Depth): Can the model solve problems when the number of operations (reasoning steps) goes deeper than the training range?

Contextual (Breadth): Does the same reasoning structure transfer when surface context changes?

The conditions under which RL produces 'true capability gain' are narrow. For in-distribution (ID) problems already well-covered by pre-training, RL increases pass@1 but barely improves pass@128 → this is closer to "sharpening" existing capabilities. Conversely, when pre-training leaves headroom and RL data targets the edge of competence (where pass@1 fails but pass@k shows some success), substantial expansion occurs with pass@128 improvements even on OOD. RL on problems that are too easy (ID) or too hard (complete OOD-hard) doesn't train well. In other words, easy problems get sharpened and difficult problems gain capability.

Contextual generalization requires a 'minimal seed'. For new long-tail context (B), if pre-training exposure is 0%~0.1%, RL also struggles to transfer. However, if context B is included in pre-training even sparsely (e.g., ≥1%) at the atomic primitive level, RL amplifies that seed to create strong transfer (large pass@128 gains). RL struggles to create something from nothing but excels at scaling up small foundations (seeds).

Mid-training is the 'bridge' that significantly affects RL efficiency. When varying the ratio (β) between mid-training and RL with the same compute budget, characteristics diverge: OOD-edge (moderately difficult domain) performance (especially pass@1) is better when mid-training proportion is high and RL is light. OOD-hard (much harder domain) generalization improves when RL proportion is increased. Mid-training lays down priors/representations; RL expands exploration and composition on top of that.

PRM reduce

AI Reward Hacking and improve performance. Using only outcome rewards (correctness) easily leads to "shortcuts/dishonest reasoning that just gets the answer right." Mixing in dense rewards based on process verification (α mix), or only giving outcome rewards when the process is correct, reduces structural errors (dependency mismatches, etc.) and improves pass@1 and some pass@128 on OOD.

arxiv.org

https://arxiv.org/pdf/2512.07783

Google’s efficiency perspective approach

While increasing the number of test-time inferences becomes less efficient compared to training time, for relatively simple problems where R is low, test-time compute scaling is more advantageous. However, the conclusion is that model scaling is also necessary for expanding the range of problems that can be solved at test time in the long term.

If a problem is fundamentally difficult or something the model has never encountered before, no amount of test-time computation will lead to significant improvements (as the model itself is incorrectly trained), making additional training (parameter/data scaling) inherently more effective

arxiv.org

https://arxiv.org/pdf/2408.03314

Increasing inference-time compute often reduces the success of attacks

Trading Inference-Time Compute for Adversarial Robustness

Initial evidence that reasoning models such as o1 become more robust to adversarial attacks as they think for longer.

https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/

“Large language models are not a dead end”

Reasoning models are just LLMs - <antirez>

https://antirez.com/news/146

Additional prompt such as “Think step by step” is unnecessary and degrades performance

Zero-shot prioritized while few-shot is OK

Provide explicit objective (success criteria) and limitation

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

https://platform.openai.com/docs/guides/reasoning-best-practices

Reasoning Models is better than Instruction Tuned Model without additional inference

www.arxiv.org

https://www.arxiv.org/pdf/2504.09858

reasoning models are generally the most robust to safety

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab collaboration.

https://openai.com/index/openai-anthropic-safety-evaluation/

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

Transformers lack structural state representation, meaning models are not internally systematic logical state machines.

Based on next-token prediction objectives, they are optimized for learning statistical patterns rather than semantics.

The reason outputs change with small perturbations is due to continuous probability mapping, which lacks reasoning algorithm stability.

arxiv.org

https://arxiv.org/pdf/2602.06176

Reasoning Model

Test-time Compute, Inference time compute

Google’s efficiency perspective approach

Increasing inference-time compute often reduces the success of attacks

“Large language models are not a dead end”

Backlinks

Recommendations