Model based RL

Creator
Created
Created
2019 Nov 5 5:18
Editor
Edited
Edited
2025 Jun 13 19:9

State transition probability, dynamics, world model (MBRL)

Model is useful for sample efficiency and somewhat task-agnostic. Models don’t optimize for task performance directly and sometimes harder to learn than a policy. Here, the model is not a policy but refers to whether to use a model that predicts the behavior of the environment.
Although MBRL is in theory off-policy (meaning it can learn from any data), in practice, it will perform poorly if you do not have on-policy data. In other words, if a model is trained on only randomly-collected data, it will (in most cases) be insufficient to describe the parts of the state space that we may actually care about. We can therefore use on-policy data collection in an iterative algorithm to improve overall task performance.
and - unlike Q-learning, the former only considers single step rewards
When modeling, we use MSE for deterministic cases and log probability for stochastic cases, but
The world model serves not just to simulate the environment, but functions as a tool to help determine optimal actions
  • Learn an approximate model based on experiences
  • Solve for values as if the learned model were correct?
Step 1: Learn empirical MDP model
  • Count outcomes s' for each s, a
  • Normalize to give an estimate of math: T\hat(s, a, s') - simple probability average of action
Step 2: Solve the learned MDP
use value iteration, as before

Different Ways to Interact with the Environment

Method performance is measured by Expert Demonstration Efficiency or amount of Real-world Interaction needed
  • online: direct interaction with environment
  • offline: using only available data
  • model-based: using learned virtual environment
Model based learnings
 

If we know the world model, how can we use it?

We should care about data distribution mismatch to prevent inaccurate model prediction and planning. There are ways to prevent this like revisiting cliff by real environment.
  1. Model-based planning
  1. Generating data
Model based learning Notion
 
 
Reinforcement Learning Implementations
 
 
State value function - sum of rewards get from states passed (all time) Action value function - expected reward from action in this state (that time)
 
 
 

Why we might want our network to predict state differences, instead of directly predicting next state

This is particularly advantageous when the differences between states are small, and learning through differentials improves training stability
Any AI agent capable of multi-step goal-directed tasks must possess an accurate internal
World Model
through constructive proof that such models can be extracted from agent policies with error bounds. The more an agent learns (higher experience), the better it becomes at solving "deeper" goals, and we can reconstruct transition probabilities more accurately just by observing its policy. There is no "model-free" shortcut: The ability to achieve long-term and complex goals inherently requires learning an accurate world model. However, it is not necessary to explicitly define and train the world model, rather, defining good
Next Token Prediction
and appropriate implicit
AI Incentive
is sufficient.
 
 

 

Recommendations