Test-time Compute, Inference time compute
- Pretraining Scaling is reaching its limits due to finite data.
- Reasoning Model scaling through CoT and AI Agent evolves reasoning capability.
- Due to Compounding Error, increasing threat to ensuring AI Alignment
Test-time compute is important because when solving problems, AI need complexity proportional to the algorithm itself - giving an immediate answer just means reciting memorized information. However, among human requests there are new queries, and we need to first determine how much thinking is required based on the problem complexity. This can be also done through pattern matching for problems which also enables intelligence extrapolation.
Test-time Compute Notion
AI Reasoning Methods

Pre-training, mid-training, and RL post-training each contribute to "reasoning capability" in different ways, which can be causally decomposed in a fully controlled (synthetic) setting. The approach uses DAG to create reasoning structures and templates (e.g., animal-zoo, teacher-school) to vary only surface context. A process-verified evaluation is used by parsing the model's output CoT and comparing the predicted graph vs. the ground truth graph.
Generalization type
- Extrapolative (Depth): Can the model solve problems when the number of operations (reasoning steps) goes deeper than the training range?
- Contextual (Breadth): Does the same reasoning structure transfer when surface context changes?
The conditions under which RL produces 'true capability gain' are narrow. For in-distribution (ID) problems already well-covered by pre-training, RL increases pass@1 but barely improves pass@128 → this is closer to "sharpening" existing capabilities. Conversely, when pre-training leaves headroom and RL data targets the edge of competence (where pass@1 fails but pass@k shows some success), substantial expansion occurs with pass@128 improvements even on OOD. RL on problems that are too easy (ID) or too hard (complete OOD-hard) doesn't train well. In other words, easy problems get sharpened and difficult problems gain capability.
Contextual generalization requires a 'minimal seed'. For new long-tail context (B), if pre-training exposure is 0%~0.1%, RL also struggles to transfer. However, if context B is included in pre-training even sparsely (e.g., ≥1%) at the atomic primitive level, RL amplifies that seed to create strong transfer (large pass@128 gains). RL struggles to create something from nothing but excels at scaling up small foundations (seeds).
Mid-training is the 'bridge' that significantly affects RL efficiency. When varying the ratio (β) between mid-training and RL with the same compute budget, characteristics diverge: OOD-edge (moderately difficult domain) performance (especially pass@1) is better when mid-training proportion is high and RL is light. OOD-hard (much harder domain) generalization improves when RL proportion is increased. Mid-training lays down priors/representations; RL expands exploration and composition on top of that.
PRM reduce AI Reward Hacking and improve performance. Using only outcome rewards (correctness) easily leads to "shortcuts/dishonest reasoning that just gets the answer right." Mixing in dense rewards based on process verification (α mix), or only giving outcome rewards when the process is correct, reduces structural errors (dependency mismatches, etc.) and improves pass@1 and some pass@128 on OOD.
Google’s efficiency perspective approach

While increasing the number of test-time inferences becomes less efficient compared to training time, for relatively simple problems where R is low, test-time compute scaling is more advantageous. However, the conclusion is that model scaling is also necessary for expanding the range of problems that can be solved at test time in the long term.
If a problem is fundamentally difficult or something the model has never encountered before, no amount of test-time computation will lead to significant improvements (as the model itself is incorrectly trained), making additional training (parameter/data scaling) inherently more effective
Increasing inference-time compute often reduces the success of attacks
Trading Inference-Time Compute for Adversarial Robustness
Initial evidence that reasoning models such as o1 become more robust to adversarial attacks as they think for longer.
https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/

“Large language models are not a dead end”
- Additional prompt such as “Think step by step” is unnecessary and degrades performance
- Zero-shot prioritized while few-shot is OK
- Provide explicit objective (success criteria) and limitation
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
https://platform.openai.com/docs/guides/reasoning-best-practices

Reasoning Models is better than Instruction Tuned Model without additional inference
reasoning models are generally the most robust to safety
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests
OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab collaboration.
https://openai.com/index/openai-anthropic-safety-evaluation/


Seonglae Cho