L2T

Creator
Creator
Seonglae Cho
Created
Created
2025 May 22 23:48
Editor
Edited
Edited
2025 May 23 8:16

Learning to Think

As one of the attempts to overcome the limitations of existing language model RL frameworks that use single-step RL, this approach divides the CoT process into reasoning episodes and provides immediate rewards at each step. The Dense Process Reward is defined to simultaneously consider Fitting Information Gain and Parameter Compression Penalty. It's an innovative and dense approach that explicitly aims to steadily increase information density within limited parameters, measuring changes by training separately for each episode divided from CoT. Similar to the Surprise Score, it impressively utilizes Fisher Information Matrix and PAC Bounds in Dense Process Reward, though due to large parameter space, it approximates using quadratic form in low rank proxy.
Having a large gradient norm does not guarantee that the model has learned meaningful information or significantly improved answer probabilities. L2T's Fisher-based quadratic form captures how deeply and effectively parameter changes are reflected in the model's output distribution (accuracy).
 
 
 
 
 
 

Recommendations