for Rewarding LLMs
Policy is language model itself and context is state. State is total state of system and observation could be partial POMDP.
Policy gradient
Reasoning Reward model
RRM
Designed to learn thought patterns autonomously without reasoning examples