Information-theoretic dense process reward
A reward system that constructs consistent and reproducible rewards by approximating the reward calculation using only internal parameter changes through the Fisher Information Matrix, which is derived by subtracting the parameter compression penalty from fitting information gain
Here, prior and posterior refer to parameter distributions before and after an episode. It is impossible to directly calculate the KL parameter compression penalty in the LLM dimension from parameter distributions. Instead, we borrow PAC Bound to efficiently approximate the "posterior-prior KL" using a theoretically guaranteed quadratic form based on the Fisher Information Matrix. This allows us to consistently measure "how much new information was gained without redundancy" using only internal model signals while keeping computational costs at a reasonable level.