Transform scalar rewards into distributions, training the policy to match the entire reward distribution.
Normalize r with a learnable partition function Z to obtain the target distribution
KL minimization ≡ GFlowNet's Trajectory Balance squared loss and expected gradient equivalence (theorem provided).
GFlowNet
A generative policy learning framework that learns a probability distribution proportional to a given reward, enabling diverse generation of good samples.