RLPR

Reinforcement Learning with Reference Probability Reward

RLPR uses token probabilities calculated internally by LLMs when generating answers as rewards, but unlike

Token Entropy, it calculates rewards at the sample (prompt) level rather than the token level.

Specifically, it uses the average generation probability of tokens as a reward, which encourages higher confidence. For standardization, it removes sample-specific bias - since directly decoding only the correct answer could result in a high probability that might be exploitable, it subtracts this probability to remove reward bias. Additionally, it enhances stability by filtering out samples with low reward variance through standard deviation filtering. This means that for each prompt, multiple responses are sampled, and prompts with low standard deviation in rewards (average probability of correct tokens) are excluded from training, effectively filtering out "samples that are too easy or too difficult to produce stable reward signals."

This approach improved performance across 7 different mathematics and general reasoning benchmarks. By implementing effective reinforcement learning rewards using only the LLM's inherent probability signals without complex domain-specific verifiers, it greatly enhances scalability and efficiency.

www.arxiv.org

https://www.arxiv.org/pdf/2506.18254

RLPR - a openbmb Collection

Extrapolating RLVR to General Domains without Verifiers