No interaction between environment, Batch RL
Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data
Offline RL is all focused on preventing overestimating OOD. Usually utilize Off-policy method due to the limited dataset.
No env interaction → limited data → off-policy → Q overestimation
구체적인 구현에서는 iteration 마다 trajectory rollout을 env.step을 통해 생성하지 않는다는 말로, OOD에서 문제가 있다는 말이라 적합한 알고리즘이 다르다는 말이지 이외의 부분에서 구현은 동일함
System does not change its approximation of the target function after the initial training phase has been completed
We don’t want to interact environment RL
The difference with Behavior Cloning is that we don't need expert trajectory for appropriate training.