RL environment where reasoning steps are actions, previous tokens are observations, and reward is the solution's correctness.
Test-time RL Models
CoT SFT distillation
While the provocative title is not exactly correct, it provides insight even for Multimodality