Sampling based optimization
Planning (search): pick the best one via back propagation
Generating (imaginary) training data (Model rollout) via sampling (gradient free)
- Sample H-step action sequences with model
- Remember the best action sequence,
Gradient based learning
- Rollout some policy (e.g. random policy) for H steps with model
- Back propagate through model with the objective
- Gradient ascent to get better
- Version 1: Guess & check (random shooting)
Sample random actions and choose action based on
- Version 2: CEM Iteration
Can model plan in abstract level for long-horizon tasks?
Only practical for short-horizon problems or very shaped reward functions because it is too compute expensive to make long plans and model is not accurate for long horizons.
Generative RL
What transfers across environments and tasks? Text-to-video (instruction → state imagination → action) generation as an universal planner. No state, no action required, just images and no reward and no task information.
- Train video Diffusion Model + Temporal super resolution
UniPi
Universal Policy
How can we execute plan?
Just train Inverse dynamics for each robot since the dynamics are different. for this, the state is image
But we don’t have real world third person point of view but we hope that they utilized all kind of view point image for generalizing. (depth images are not used for training in this paper tho)
UniPi can synthesize a diverse set of behaviors which satisfy language instructions. Plausible way to extend decision making.
UniSim
Universal simulator
- - Computationally heavy
- + Can provide web-scale knowledge like LLMs by leveraging abundant data in various forms