Test-time RL
Test-time RL environment where reasoning steps are actions, previous tokens are observations, and reward is the solution's correctness.
LLM Reasoning Models

CoT SFT distillation
While the provocative title is not exactly correct, it provides insight even for Multimodality