Dense Process Reward

Information-theoretic dense process reward

A reward system that constructs consistent and reproducible rewards by approximating the reward calculation using only internal parameter changes through the

Fisher Information Matrix, which is derived by subtracting the parameter compression penalty from fitting information gain

Here, prior and posterior refer to parameter distributions before and after an episode. It is impossible to directly calculate the KL parameter compression penalty in the LLM dimension from parameter distributions. Instead, we borrow

PAC Bound to efficiently approximate the "posterior-prior KL" using a theoretically guaranteed quadratic form based on the

Fisher Information Matrix. This allows us to consistently measure "how much new information was gained without redundancy" using only internal model signals while keeping computational costs at a reasonable level.

www.arxiv.org

https://www.arxiv.org/pdf/2505.10425

Dense Process Reward

Information-theoretic dense process reward

Backlinks

Recommendations