Gradient based learning

GBP

Rollout some policy (e.g. random policy) for H steps with model

Back propagate through model with the objective

Gradient ascent to get better

Version 1: Guess & check (random shooting)

Sample random actions and choose action based on

Version 2:
CEM Iteration

gradient-based planning is weak because the planner's actions/trajectories deviate from the training distribution, causing the world model to fail and resulting in poor gradients → By creating training data from 'the distribution/worst-case scenarios that GBP will encounter' and finetuning the world model accordingly, GBP can achieve

CEM Iteration-level performance at a much lower cost."

Online World Modeling (OWM)

Generate actions with GBP → rollout those actions in the actual simulator to correct states → retrain the world model with the corrected trajectory

Effect: Include OOD latent regions encountered during planning in the training distribution to reduce long-horizon error accumulation.

Adversarial World Modeling (AWM)

Create FGSM adversarial perturbations on state/action from expert data (in the direction that maximizes model loss) and train with them

Effect: Make the world model's input gradient/induced planning landscape smoother, so GBP is less stuck in local minima/flat regions. (And this is possible without a simulator)

arxiv.org

https://arxiv.org/pdf/2512.09929

Gradient based learning

GBP

Recommendations