Offline RL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 10 8:20
Editor
Edited
Edited
2024 Jun 18 12:27

No interaction between environment, Batch RL

Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data
Offline RL is all focused on preventing overestimating
OOD
. Usually utilize
Off-policy
method due to the limited dataset.
No env interaction → limited data → off-policy → Q overestimation
구체적인 구현에서는 iteration 마다 trajectory rollout을 env.step을 통해 생성하지 않는다는 말로, OOD에서 문제가 있다는 말이라 적합한 알고리즘이 다르다는 말이지 이외의 부분에서 구현은 동일함
notion image
System does not change its approximation of the target function after the initial training phase has been completed
We don’t want to interact environment RL
The difference with
Behavior Cloning
is that we don't need expert trajectory for appropriate training.
notion image
 
notion image
 
notion image
 
 
 
 
 

Recommendations