GBP
- Rollout some policy (e.g. random policy) for H steps with model
- Back propagate through model with the objective
- Gradient ascent to get better
- Version 1: Guess & check (random shooting)
Sample random actions and choose action based on
- Version 2: CEM Iteration
gradient-based planning is weak because the planner's actions/trajectories deviate from the training distribution, causing the world model to fail and resulting in poor gradients → By creating training data from 'the distribution/worst-case scenarios that GBP will encounter' and finetuning the world model accordingly, GBP can achieve CEM Iteration-level performance at a much lower cost."
Online World Modeling (OWM)
- Generate actions with GBP → rollout those actions in the actual simulator to correct states → retrain the world model with the corrected trajectory
- Effect: Include OOD latent regions encountered during planning in the training distribution to reduce long-horizon error accumulation.
Adversarial World Modeling (AWM)
- Create FGSM adversarial perturbations on state/action from expert data (in the direction that maximizes model loss) and train with them
- Effect: Make the world model's input gradient/induced planning landscape smoother, so GBP is less stuck in local minima/flat regions. (And this is possible without a simulator)

Seonglae Cho