FlowRL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 9 22:48
Editor
Edited
Edited
2025 Oct 9 22:59
Refs
Refs
Transform scalar rewards into distributions, training the policy to match the entire reward distribution.
Normalize r with a learnable partition function Z to obtain the target distribution
KL minimization ≡ GFlowNet's Trajectory Balance squared loss and expected gradient equivalence (theorem provided).
Importance sampling
+ clipping for
Off-policy
Policy Rollout
reuse stabilization.

GFlowNet

A generative policy learning framework that learns a probability distribution proportional to a given reward, enabling diverse generation of good samples.
 
 
 
 
 

Recommendations