State transition probability, dynamics, world model (MBRL)
Model is useful for sample efficiency and somewhat task-agnostic. Models don’t optimize for task performance directly and sometimes harder to learn than a policy. Here, the model is not a policy but refers to whether to use a model that predicts the behavior of the environment.
Although MBRL is in theory off-policy (meaning it can learn from any data), in practice, it will perform poorly if you do not have on-policy data. In other words, if a model is trained on only randomly-collected data, it will (in most cases) be insufficient to describe the parts of the state space that we may actually care about. We can therefore use on-policy data collection in an iterative algorithm to improve overall task performance.
and 전자는 Q와 달리 single step reward만 고려
모사할 때는 deterministic은 MSE, stochastic은 log probability로 학습하지만
월드 모델이 단순히 환경을 모사하는 게 아니라, 최적의 행동 결정을 돕는 도구로 기능하게 한다
- Learn an approximate model based on experiences
- Solve for values as if the learned model were correct?
Step 1: Learn empirical MDP model
- Count outcomes s’ for each s, a
- Normalize to give an estimate of
math: T\hat(s, a, s')
- 단순확률평균 of action
Step 2: Solve the learned MDP
use value iteration, as before
환경과 interaction 하는 여러 방법
Expert Demonstration Efficient 하거나 Real-world Interaction 필요량으로 method 성능 측정
- online: 환경과 직접 interation
- offline: 가진 데이터만 이용
- model-based: 학습한 가상 환경 이용
Model based learnings
If we know the world model, how can we use it?
We should care about data distribution mismatch to prevent inaccurate model prediction and planning. There are ways to prevent this like revisiting cliff by real environment.
- Model-based planning
- Generating data
Model based learning Notion
State value function - sum of rewards get from states passed (all time) Action value function - expected reward from action in this state (that time)
Why we might want our network to predict state differences, instead of directly predicting next state
특히 상태간 차이가 작을 때 유리하고, 미분단에서 학습하기 때문에 학습 안정성 향상