Learning to Think
As one of the attempts to overcome the limitations of existing language model RL frameworks that use single-step RL, this approach divides the CoT process into reasoning episodes and provides immediate rewards at each step. The Dense Process Reward is defined to simultaneously consider Fitting Information Gain and Parameter Compression Penalty. It's an innovative and dense approach that explicitly aims to steadily increase information density within limited parameters, measuring changes by training separately for each episode divided from CoT. Similar to the Surprise Score, it impressively utilizes Fisher Information Matrix and PAC Bounds in Dense Process Reward, though due to large parameter space, it approximates using quadratic form in low rank proxy.
Having a large gradient norm does not guarantee that the model has learned meaningful information or significantly improved answer probabilities. L2T's Fisher-based quadratic form captures how deeply and effectively parameter changes are reflected in the model's output distribution (accuracy).