No interaction between environment, Batch Learning
Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data
Offline RL is all focused on preventing overestimating OOD. Usually utilize Off-policy method due to the limited dataset.
No env interaction → limited data → off-policy → Q overestimation
In terms of specific implementation, it means that trajectory rollout is not generated through
env.step for each iteration, indicating that there are issues with OOD, suggesting that suitable algorithms are different, but implementation in other aspects remains the same
System does not change its approximation of the target function after the initial training phase has been completed
We don’t want to interact environment RL
The difference with Behavior Cloning is that we don't need expert trajectory for appropriate training.



Offline learning
In machine learning, systems which employ offline learning do not change their approximation of the target function when the initial training phase has been completed.[citation needed] These systems are also typically examples of eager learning.[citation needed]
https://en.wikipedia.org/wiki/Offline_learning

Seonglae Cho