Test-time RL
Test-time RL environment where reasoning steps are actions, previous tokens are observations, and reward is the solution's correctness.
LLM Reasoning Models

The Problem with Reasoners | Aidan McLaughlin
Over the next 5 months, the AI industry will pivot entirely from building larger models to building better reasoners. Unfortunately, this project is doomed and will not scale past human-level intelligence in ways you should care about. Let’s talk about why.
https://aidanmclaughlin.notion.site/reasoners-problem

CoT SFT distillation
Sky-T1: Train your own O1 preview model within $450
We introduce Sky-T1-32B-Preview, our reasoning model that performs on par with o1-preview on popular reasoning and coding benchmarks.
https://novasky-ai.github.io/posts/sky-t1/
While the provocative title is not exactly correct, it provides insight even for Multimodality

Seonglae Cho

