Offline Learning

Creator

Creator

Seonglae Cho

Created

Created

2023 Sep 10 8:20

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Dec 18 16:3

Refs

Refs

Online Learning

Q overestimation

No interaction between environment, Batch Learning

Offline RL for improving the policy beyond the behavior policy which are used in rollout. Online data is expensive. Reusing offline data is always good! Need to handle unseen actions in a safe way, while doing better than data

Offline RL is all focused on preventing overestimating

OOD. Usually utilize

Off-policy method due to the limited dataset.

No env interaction → limited data → off-policy → Q overestimation

In terms of specific implementation, it means that trajectory rollout is not generated through env.step for each iteration, indicating that there are issues with OOD, suggesting that suitable algorithms are different, but implementation in other aspects remains the same

notion image

System does not change its approximation of the target function after the initial training phase has been completed

We don’t want to interact environment RL

The difference with

Behavior Cloning is that we don't need expert trajectory for appropriate training.

notion image

notion image

notion image

Offline learning

In machine learning, systems which employ offline learning do not change their approximation of the target function when the initial training phase has been completed.[citation needed] These systems are also typically examples of eager learning.[citation needed]

Offline learning

https://en.wikipedia.org/wiki/Offline_learning

Action chunking (Q chunking)

https://www.arxiv.org/pdf/2507.07969

Leveraging

World Model for

Offline Learning to

Online Learning RL for

Distribution Shift grounding

https://openreview.net/pdf?id=oBXfPyi47m

Backlinks

Q overestimation Sergey Levine Reinforcement Learning Term CQL Q learning Language Model RL Sample efficiency

Recommendations

///////