FlowRL

Transform scalar rewards into distributions, training the policy to match the entire reward distribution.

Normalize r with a learnable partition function Z to obtain the target distribution

KL minimization ≡ GFlowNet's Trajectory Balance squared loss and expected gradient equivalence (theorem provided).

Importance sampling + clipping for

Off-policy

Policy Rollout reuse stabilization.

GFlowNet

A generative policy learning framework that learns a probability distribution proportional to a given reward, enabling diverse generation of good samples.

www.arxiv.org

https://www.arxiv.org/pdf/2509.15207

GFlowNet Foundations

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately...

https://arxiv.org/abs/2111.09266

FlowRL

GFlowNet

Recommendations