Off-policy

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 20 2:17
Editor
Edited
Edited
2024 Jun 18 23:6
Refs
Refs
On-policy

Update policy on data so far

able to improve the policy without generating new samples from that policy
on policy는 policy에 의존성이 있어서 policy update에 따라 데이터 생성을 해주어야하는 의존성이 있다. 즉 병렬성이 안좋아서 off policy가 대량학습을 위해서는 좋다
notion image
 
notion image
The problem is that the Q-function is unreliable on OOD(Out of distribution) actions. So we need to fit close to unknown (behavior policy) by the data
 
 
 
 

Recommendations