On-policy

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 13 2:45
Editor
Edited
Edited
2024 Jun 16 7:59
Refs

Update policy on latest data

Each time the policy is changed, we need to generate new samples

on policy는 policy에 의존성이 있어서 policy update에 따라 데이터 생성을 해주어야하는 의존성이 있다. 즉 병렬성이 안좋아서 off policy가 대량학습을 위해서는 좋다
Behavior policy = Target Policy

Behavior policy

Policy who actually select action and then get reward data

Target policy

Target policy to evaluate and improve
 
 
 
 
 

Recommendations