Reasoning Model

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 21 0:30
Editor
Edited
Edited
2025 Apr 27 0:57

Test-time Compute, Inference time compute

  • Pretraining Scaling is reaching its limits due to finite data.
Test-time compute is important because when solving problems, AI need complexity proportional to the algorithm itself - giving an immediate answer just means reciting memorized information. However, among human requests there are new queries, and we need to first determine how much thinking is required based on the problem complexity. This can be also done through pattern matching for problems which also enables intelligence extrapolation.
Test-time Compute Notion
 
 
 
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
 

Google’s efficiency perspective approach

notion image
R=InferenceTokensPretrainingTokensR = \frac{Inference Tokens}{PretrainingTokens}
While increasing the number of test-time inferences becomes less efficient compared to training time, for relatively simple problems where R is low, test-time compute scaling is more advantageous. However, the conclusion is that model scaling is also necessary for expanding the range of problems that can be solved at test time in the long term.
If a problem is fundamentally difficult or something the model has never encountered before, no amount of test-time computation will lead to significant improvements (as the model itself is incorrectly trained), making additional training (parameter/data scaling) inherently more effective
Huggingface - new scaling after pretraining

Increasing inference-time compute often reduces the success of attacks

“Large language models are not a dead end”

  • Additional prompt such as “Think step by step” is unnecessary and degrades performance
  • Zero-shot prioritized while few-shot is OK
  • Provide explicit objective (success criteria) and limitation
Reasoning Models Don’t Always Say What They Think
Through RL, while fidelity initially increased, it soon plateaued. Even in reward hacking scenarios, the model rarely revealed its hacking strategies in CoT. This suggests that while CoT monitoring can catch some unintended behaviors, it alone is not a reliable means of ensuring safety. In other words, even when given answer hints, the model did not disclose using them during the CoT process.
Reasoning Models is better than Instruction Tuned Model Without additional inference
 
 
 
 

Recommendations