o1

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Nov 27 21:20
Editor
Edited
Edited
2024 Dec 21 14:54
Refs
Refs
o1 uses a RL environment where reasoning steps are actions, previous tokens are observations, and reward is the solution's correctness.
 
 

Implementation

QwQ Alibaba

 
 

Recommendations