State transition probability, dynamics, world model (MBRL)
Model is useful for sample efficiency and somewhat task-agnostic. Models don’t optimize for task performance directly and sometimes harder to learn than a policy. Here, the model is not a policy but refers to whether to use a model that predicts the behavior of the environment.
Although MBRL is in theory off-policy (meaning it can learn from any data), in practice, it will perform poorly if you do not have on-policy data. In other words, if a model is trained on only randomly-collected data, it will (in most cases) be insufficient to describe the parts of the state space that we may actually care about. We can therefore use on-policy data collection in an iterative algorithm to improve overall task performance.
and - unlike Q-learning, the former only considers single step rewards
When modeling, we use MSE for deterministic cases and log probability for stochastic cases, but
The world model serves not just to simulate the environment, but functions as a tool to help determine optimal actions
- Learn an approximate model based on experiences
- Solve for values as if the learned model were correct?
Step 1: Learn empirical MDP model
- Count outcomes s' for each s, a
- Normalize to give an estimate of
math: T\hat(s, a, s')
- simple probability average of action
Step 2: Solve the learned MDP
use value iteration, as before
Different Ways to Interact with the Environment
Method performance is measured by Expert Demonstration Efficiency or amount of Real-world Interaction needed
- online: direct interaction with environment
- offline: using only available data
- model-based: using learned virtual environment
Model based learnings
If we know the world model, how can we use it?
We should care about data distribution mismatch to prevent inaccurate model prediction and planning. There are ways to prevent this like revisiting cliff by real environment.
- Model-based planning
- Generating data
Model based learning Notion
Reinforcement Learning Implementations
State value function - sum of rewards get from states passed (all time) Action value function - expected reward from action in this state (that time)
Why we might want our network to predict state differences, instead of directly predicting next state
This is particularly advantageous when the differences between states are small, and learning through differentials improves training stability
Any AI agent capable of multi-step goal-directed tasks must possess an accurate internal World Model through constructive proof that such models can be extracted from agent policies with error bounds. The more an agent learns (higher experience), the better it becomes at solving "deeper" goals, and we can reconstruct transition probabilities more accurately just by observing its policy. There is no "model-free" shortcut: The ability to achieve long-term and complex goals inherently requires learning an accurate world model. However, it is not necessary to explicitly define and train the world model, rather, defining good Next Token Prediction and appropriate implicit AI Incentive is sufficient.