On-policy

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 13 2:45
Editor
Edited
Edited
2024 Jun 16 7:59
Refs

Update policy on latest data

Each time the policy is changed, we need to generate new samples

On-policy methods have a dependency on the policy, requiring new data generation each time the policy is updated. This dependency limits parallelization, making off-policy methods better suited for large-scale training.
Behavior policy = Target Policy

Behavior policy

Policy who actually select action and then get reward data

Target policy

Target policy to evaluate and improve
 
 
 
 
 

Recommendations