Dense Process Reward

Creator
Creator
Seonglae Cho
Created
Created
2025 May 22 23:40
Editor
Edited
Edited
2025 May 22 23:48
Refs
Refs

Information-theoretic dense process reward

A reward system that constructs consistent and reproducible rewards by approximating the reward calculation using only internal parameter changes through the
Fisher Information Matrix
, which is derived by subtracting the parameter compression penalty from fitting information gain
Here, prior and posterior refer to parameter distributions before and after an episode. It is impossible to directly calculate the KL parameter compression penalty in the LLM dimension from parameter distributions. Instead, we borrow
PAC Bound
to efficiently approximate the "posterior-prior KL" using a theoretically guaranteed quadratic form based on the
Fisher Information Matrix
. This allows us to consistently measure "how much new information was gained without redundancy" using only internal model signals while keeping computational costs at a reasonable level.
 
 
 
 

Recommendations