Off-policy

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 20 2:17
Editor
Edited
Edited
2025 Oct 19 23:12
Refs
Refs
On-policy

Update policy on data so far

able to improve the policy without generating new samples from that policy
On-policy methods have a dependency on the policy, requiring new data generation each time the policy is updated. This dependency limits parallelization, making off-policy methods better suited for large-scale training.
notion image
notion image
The problem is that the Q-function is unreliable on OOD(Out of distribution) actions. So we need to fit close to unknown (behavior policy) by the data
 
 
 
 

Recommendations