Offline Learning

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 10 8:20
Editor
Edited
Edited
2024 Nov 21 10:13

No interaction between environment, Batch Learning

Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data
Offline RL is all focused on preventing overestimating
OOD
. Usually utilize
Off-policy
method due to the limited dataset.
No env interaction → limited data → off-policy → Q overestimation
In terms of specific implementation, it means that trajectory rollout is not generated through env.step for each iteration, indicating that there are issues with OOD, suggesting that suitable algorithms are different, but implementation in other aspects remains the same
notion image
System does not change its approximation of the target function after the initial training phase has been completed
We don’t want to interact environment RL
The difference with
Behavior Cloning
is that we don't need expert trajectory for appropriate training.
notion image
 
notion image
 
notion image
 
 
 
 
 

Recommendations