RLPR

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 3 23:16
Editor
Edited
Edited
2025 Jul 3 23:29

Reinforcement Learning with Reference Probability Reward

RLPR uses token probabilities calculated internally by LLMs when generating answers as rewards, but unlike
Token Entropy
, it calculates rewards at the sample (prompt) level rather than the token level.
Specifically, it uses the average generation probability of tokens as a reward, which encourages higher confidence. For standardization, it removes sample-specific bias - since directly decoding only the correct answer could result in a high probability that might be exploitable, it subtracts this probability to remove reward bias. Additionally, it enhances stability by filtering out samples with low reward variance through standard deviation filtering. This means that for each prompt, multiple responses are sampled, and prompts with low standard deviation in rewards (average probability of correct tokens) are excluded from training, effectively filtering out "samples that are too easy or too difficult to produce stable reward signals."
This approach improved performance across 7 different mathematics and general reasoning benchmarks. By implementing effective reinforcement learning rewards using only the LLM's inherent probability signals without complex domain-specific verifiers, it greatly enhances scalability and efficiency.
 
 
 

VeriFree
VeriFree
sail-sgUpdated 2025 Aug 21 6:27

 

Recommendations