Model based RL

State transition probability, dynamics, world model (MBRL)

Model is useful for sample efficiency and somewhat task-agnostic. Models don’t optimize for task performance directly and sometimes harder to learn than a policy. Here, the model is not a policy but refers to whether to use a model that predicts the behavior of the environment.

Although MBRL is in theory off-policy (meaning it can learn from any data), in practice, it will perform poorly if you do not have on-policy data. In other words, if a model is trained on only randomly-collected data, it will (in most cases) be insufficient to describe the parts of the state space that we may actually care about. We can therefore use on-policy data collection in an iterative algorithm to improve overall task performance.

p(r|s,a)

and

p(s' | s, a)

전자는 Q와 달리 single step reward만 고려

p(s', r | s,a) \rightarrow \pi_\theta(a|s)

모사할 때는 deterministic은 MSE, stochastic은 log probability로 학습하지만

월드 모델이 단순히 환경을 모사하는 게 아니라, 최적의 행동 결정을 돕는 도구로 기능하게 한다

Learn an approximate model based on experiences

Solve for values as if the learned model were correct?

Step 1: Learn empirical MDP model

Count outcomes s’ for each s, a

Normalize to give an estimate of math: T\hat(s, a, s') - 단순확률평균 of action

Step 2: Solve the learned MDP

use value iteration, as before

환경과 interaction 하는 여러 방법

Expert Demonstration Efficient 하거나 Real-world Interaction 필요량으로 method 성능 측정

online: 환경과 직접 interation

offline: 가진 데이터만 이용

model-based: 학습한 가상 환경 이용

Model based learnings

IMPALA

If we know the world model, how can we use it?

We should care about data distribution mismatch to prevent inaccurate model prediction and planning. There are ways to prevent this like revisiting cliff by real environment.

Model-based planning

Generating data

Model based learning Notion

Model-based Planning

Model-based policy optimization

Model rollout

Latent dynamics model

Zero supervision

Reinforcement Learning Implementations

Pixel based RL

RL for real-world

State value function - sum of rewards get from states passed (all time) Action value function - expected reward from action in this state (that time)

Why we might want our network to predict state differences, instead of directly predicting next state

특히 상태간 차이가 작을 때 유리하고, 미분단에서 학습하기 때문에 학습 안정성 향상

arxiv.org

https://arxiv.org/pdf/1708.02596