Test-time Compute, Inference time compute
- Pretraining Scaling is reaching its limits due to finite data.
- Reasoning Model scaling through CoT and AI Agent evolves reasoning capability.
- Due to Compounding Error, increasing threat to ensuring AI Alignment
Test-time compute is important because when solving problems, AI need complexity proportional to the algorithm itself - giving an immediate answer just means reciting memorized information. However, among human requests there are new queries, and we need to first determine how much thinking is required based on the problem complexity. This can be also done through pattern matching for problems which also enables intelligence extrapolation.
Test-time Compute Notion
Google’s efficiency perspective approach
While increasing the number of test-time inferences becomes less efficient compared to training time, for relatively simple problems where R is low, test-time compute scaling is more advantageous. However, the conclusion is that model scaling is also necessary for expanding the range of problems that can be solved at test time in the long term.
If a problem is fundamentally difficult or something the model has never encountered before, no amount of test-time computation will lead to significant improvements (as the model itself is incorrectly trained), making additional training (parameter/data scaling) inherently more effective
Huggingface - new scaling after pretraining
Increasing inference-time compute often reduces the success of attacks
“Large language models are not a dead end”
- Additional prompt such as “Think step by step” is unnecessary and degrades performance
- Zero-shot prioritized while few-shot is OK
- Provide explicit objective (success criteria) and limitation
Reasoning Models Don’t Always Say What They Think
Through RL, while fidelity initially increased, it soon plateaued. Even in reward hacking scenarios, the model rarely revealed its hacking strategies in CoT. This suggests that while CoT monitoring can catch some unintended behaviors, it alone is not a reliable means of ensuring safety. In other words, even when given answer hints, the model did not disclose using them during the CoT process.
Reasoning Models is better than Instruction Tuned Model Without additional inference