Transform scalar rewards into distributions, training the policy to match the entire reward distribution.
Normalize r with a learnable partition function Z to obtain the target distribution
KL minimization ≡ GFlowNet's Trajectory Balance squared loss and expected gradient equivalence (theorem provided).
GFlowNet
A generative policy learning framework that learns a probability distribution proportional to a given reward, enabling diverse generation of good samples.
GFlowNet Foundations
Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately...
https://arxiv.org/abs/2111.09266


Seonglae Cho