Update policy on latest data
Each time the policy is changed, we need to generate new samples
On-policy methods have a dependency on the policy, requiring new data generation each time the policy is updated. This dependency limits parallelization, making off-policy methods better suited for large-scale training.
Behavior policy = Target Policy
Behavior policy
Policy who actually select action and then get reward data
Target policy
Target policy to evaluate and improve

Seonglae Cho