DreamDojo

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 10 15:45
Editor
Edited
Edited
2026 Feb 10 15:52
Refs
Refs

Action-conditioned world model

Large-scale human videos are abundant, but action labels are scarce. The world model is pretrained on 44k hours of egocentric human video, addressing the lack of action labels with continuous latent actions (32-d embeddings extracted from frame-to-frame motion via VAE) to attach unified proxy actions to all videos for training. Based on
NVIDIA Cosmos Predict
2.5 latent video diffusion (Flow Matching). Uses relative actions (relative changes instead of absolute joints) for robot control. Fine-tunes part of the action MLP by resetting/finetuning with small amounts of target robot data to adapt to embodiments (e.g., GR-1, G1, AgiBot, etc.).
 
 
 
arxiv.org
 
 

Recommendations