On-policy

Creator

Seonglae Cho

Created

2024 Mar 13 2:45

Editor

Seonglae Cho

Edited

2024 Jun 16 7:59

Refs

Off-policy

Update policy on latest data

Each time the policy is changed, we need to generate new samples

On-policy methods have a dependency on the policy, requiring new data generation each time the policy is updated. This dependency limits parallelization, making off-policy methods better suited for large-scale training.

Behavior policy = Target Policy

Behavior policy

Policy who actually select action and then get reward data

Target policy

Target policy to evaluate and improve

Backlinks

Policy Gradient Learning GRPO Reinforcement Learning Term Actor Critic

Recommendations

///////