Offline Learning

No interaction between environment, Batch Learning

Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data

Offline RL is all focused on preventing overestimating

OOD. Usually utilize

Off-policy method due to the limited dataset.

No env interaction → limited data → off-policy → Q overestimation

In terms of specific implementation, it means that trajectory rollout is not generated through env.step for each iteration, indicating that there are issues with OOD, suggesting that suitable algorithms are different, but implementation in other aspects remains the same