Model-based Planning

Sampling based optimization

Planning (search): pick the best one via back propagation

Generating (imaginary) training data (Model rollout) via sampling (gradient free)

Sample H-step action sequences with model

Remember the best action sequence, $\max_{a_{t:t+H}}\sum(s_t,a_t)$

Gradient based learning

Rollout some policy (e.g. random policy) for H steps with model

Back propagate through model with the objective $\max_{a_{t:t+H}}\sum r(s_t,a_t)$

Gradient ascent to get better $a_{t:t+H}$

Version 1: Guess & check (random shooting)

Sample random actions and choose action based on

\argmax

Version 2:
CEM Iteration

Can model plan in abstract level for long-horizon tasks?

Only practical for short-horizon problems or very shaped reward functions because it is too compute expensive to make long plans and model is not accurate for long horizons.

Hierarchical RL

Generative RL

What transfers across environments and tasks? Text-to-video (instruction → state imagination → action) generation as an universal planner. No state, no action required, just images and no reward and no task information.

Train video
Diffusion Model + Temporal super resolution

UniPi

Universal Policy

How can we execute plan?

Just train Inverse dynamics

\pi(a_t|s_t, s_{t+1})

for each robot since the dynamics are different. for this, the state is image

But we don’t have real world third person point of view but we hope that they utilized all kind of view point image for generalizing. (depth images are not used for training in this paper tho)

UniPi can synthesize a diverse set of behaviors which satisfy language instructions. Plausible way to extend decision making.

UniSim

Universal simulator

- Computationally heavy

+ Can provide web-scale knowledge like LLMs by leveraging abundant data in various forms

UniSim: Learning Interactive Real-World Simulators

https://universal-simulator.github.io/unisim/