LLM Reasoning Model

Creator

Creator

Seonglae Cho

Created

Created

2024 Nov 27 21:20

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Oct 24 1:5

Refs

Refs

Reinforcement Learning

Reasoning Model

Distance Algorithm

Language Model RL

Test-time RL

Test-time RL environment where reasoning steps are actions, previous tokens are observations, and reward is the solution's correctness.

LLM Reasoning Models

OpenAI O series

https://arxiv.org/pdf/2412.14135

The Problem with Reasoners | Aidan McLaughlin

Over the next 5 months, the AI industry will pivot entirely from building larger models to building better reasoners. Unfortunately, this project is doomed and will not scale past human-level intelligence in ways you should care about. Let’s talk about why.

The Problem with Reasoners | Aidan McLaughlin

https://aidanmclaughlin.notion.site/reasoners-problem

The Problem with Reasoners | Aidan McLaughlin

CoT SFT distillation

Sky-T1: Train your own O1 preview model within $450

We introduce Sky-T1-32B-Preview, our reasoning model that performs on par with o1-preview on popular reasoning and coding benchmarks.

Sky-T1: Train your own O1 preview model within $450

https://novasky-ai.github.io/posts/sky-t1/

SFT Memorizes, RL Generalizes (

Model Generalization,

While the provocative title is not exactly correct, it provides insight even for Multimodality

SFT Memorizes, RL Generalizes

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

https://tianzhechu.com/SFTvsRL/

SFT Memorizes, RL Generalizes

Recommendations

///////