DreamDojo

Action-conditioned world model

Large-scale human videos are abundant, but action labels are scarce. The world model is pretrained on 44k hours of egocentric human video, addressing the lack of action labels with continuous latent actions (32-d embeddings extracted from frame-to-frame motion via VAE) to attach unified proxy actions to all videos for training. Based on

NVIDIA Cosmos Predict 2.5 latent video diffusion (Flow Matching). Uses relative actions (relative changes instead of absolute joints) for robot control. Fine-tunes part of the action MLP by resetting/finetuning with small amounts of target robot data to adapt to embodiments (e.g., GR-1, G1, AgiBot, etc.).

arxiv.org

https://arxiv.org/pdf/2602.06949

DreamDojo

Action-conditioned world model

Recommendations