Reasoning Model

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 21 0:30
Editor
Edited
Edited
2026 Jan 11 0:4

Test-time Compute, Inference time compute

  • Pretraining Scaling is reaching its limits due to finite data.
Test-time compute is important because when solving problems, AI need complexity proportional to the algorithm itself - giving an immediate answer just means reciting memorized information. However, among human requests there are new queries, and we need to first determine how much thinking is required based on the problem complexity. This can be also done through pattern matching for problems which also enables intelligence extrapolation.
Test-time Compute Notion
 
 
AI Reasoning Methods
 
 
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
 
 
 
 
 
Pre-training, mid-training, and RL post-training each contribute to "reasoning capability" in different ways, which can be causally decomposed in a fully controlled (synthetic) setting. The approach uses
DAG
to create reasoning structures and templates (e.g., animal-zoo, teacher-school) to vary only surface context. A process-verified evaluation is used by parsing the model's output CoT and comparing the predicted graph vs. the ground truth graph.
Generalization type
  • Extrapolative (Depth): Can the model solve problems when the number of operations (reasoning steps) goes deeper than the training range?
  • Contextual (Breadth): Does the same reasoning structure transfer when surface context changes?
The conditions under which RL produces 'true capability gain' are narrow. For in-distribution (ID) problems already well-covered by pre-training, RL increases pass@1 but barely improves pass@128 → this is closer to "sharpening" existing capabilities. Conversely, when pre-training leaves headroom and RL data targets the edge of competence (where pass@1 fails but pass@k shows some success), substantial expansion occurs with pass@128 improvements even on OOD. RL on problems that are too easy (ID) or too hard (complete OOD-hard) doesn't train well. In other words, easy problems get sharpened and difficult problems gain capability.
Contextual generalization requires a 'minimal seed'. For new long-tail context (B), if pre-training exposure is 0%~0.1%, RL also struggles to transfer. However, if context B is included in pre-training even sparsely (e.g., ≥1%) at the atomic primitive level, RL amplifies that seed to create strong transfer (large pass@128 gains). RL struggles to create something from nothing but excels at scaling up small foundations (seeds).
Mid-training is the 'bridge' that significantly affects RL efficiency. When varying the ratio (β) between mid-training and RL with the same compute budget, characteristics diverge: OOD-edge (moderately difficult domain) performance (especially pass@1) is better when mid-training proportion is high and RL is light. OOD-hard (much harder domain) generalization improves when RL proportion is increased. Mid-training lays down priors/representations; RL expands exploration and composition on top of that.
PRM
reduce
AI Reward Hacking
and improve performance. Using only outcome rewards (correctness) easily leads to "shortcuts/dishonest reasoning that just gets the answer right." Mixing in dense rewards based on process verification (α mix), or only giving outcome rewards when the process is correct, reduces structural errors (dependency mismatches, etc.) and improves pass@1 and some pass@128 on OOD.

Google’s efficiency perspective approach

notion image
While increasing the number of test-time inferences becomes less efficient compared to training time, for relatively simple problems where R is low, test-time compute scaling is more advantageous. However, the conclusion is that model scaling is also necessary for expanding the range of problems that can be solved at test time in the long term.
If a problem is fundamentally difficult or something the model has never encountered before, no amount of test-time computation will lead to significant improvements (as the model itself is incorrectly trained), making additional training (parameter/data scaling) inherently more effective

Increasing inference-time compute often reduces the success of attacks

“Large language models are not a dead end”

  • Additional prompt such as “Think step by step” is unnecessary and degrades performance
  • Zero-shot prioritized while few-shot is OK
  • Provide explicit objective (success criteria) and limitation
Reasoning Models is better than Instruction Tuned Model without additional inference
reasoning models are generally the most robust to safety
 

Recommendations