No interaction between environment, Batch Learning
Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data
Offline RL is all focused on preventing overestimating OOD. Usually utilize Off-policy method due to the limited dataset.
No env interaction → limited data → off-policy → Q overestimation
In terms of specific implementation, it means that trajectory rollout is not generated through
env.step
for each iteration, indicating that there are issues with OOD, suggesting that suitable algorithms are different, but implementation in other aspects remains the sameSystem does not change its approximation of the target function after the initial training phase has been completed
We don’t want to interact environment RL
The difference with Behavior Cloning is that we don't need expert trajectory for appropriate training.