Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Industry/AI Scaling/Reasoning Model/
Test-time RL
Search

Test-time RL

Creator
Creator
Seonglae Cho
Created
Created
2024 Nov 27 21:20
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Apr 27 0:35
Refs
Refs
Reinforcement Learning
Reasoning Model
Distance Algorithm
Language Model RL
RL environment where reasoning steps are actions, previous tokens are observations, and reward is the solution's correctness.
Test-time RL Models
O1
LLaMa CoT
Gemini Flash
QwQ Alibaba
Deepseek R1
Kimi k
Hunyuan-T1
https://arxiv.org/pdf/2412.14135
 
 
The Problem with Reasoners | Aidan McLaughlin
Over the next 5 months, the AI industry will pivot entirely from building larger models to building better reasoners. Unfortunately, this project is doomed and will not scale past human-level intelligence in ways you should care about. Let’s talk about why.
The Problem with Reasoners | Aidan McLaughlin
https://aidanmclaughlin.notion.site/reasoners-problem
The Problem with Reasoners | Aidan McLaughlin
CoT SFT distillation
Sky-T1: Train your own O1 preview model within $450
We introduce Sky-T1-32B-Preview, our reasoning model that performs on par with o1-preview on popular reasoning and coding benchmarks.
Sky-T1: Train your own O1 preview model within $450
https://novasky-ai.github.io/posts/sky-t1/
SFT Memorizes, RL Generalizes (
AI Memory
,
Model Generalization
,
OOD
)
While the provocative title is not exactly correct, it provides insight even for Multimodality
SFT Memorizes, RL Generalizes
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
SFT Memorizes, RL Generalizes
https://tianzhechu.com/SFTvsRL/
SFT Memorizes, RL Generalizes
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Industry/AI Scaling/Reasoning Model/
Test-time RL
Copyright Seonglae Cho