Action-conditioned world model
Large-scale human videos are abundant, but action labels are scarce. The world model is pretrained on 44k hours of egocentric human video, addressing the lack of action labels with continuous latent actions (32-d embeddings extracted from frame-to-frame motion via VAE) to attach unified proxy actions to all videos for training. Based on NVIDIA Cosmos Predict 2.5 latent video diffusion (Flow Matching). Uses relative actions (relative changes instead of absolute joints) for robot control. Fine-tunes part of the action MLP by resetting/finetuning with small amounts of target robot data to adapt to embodiments (e.g., GR-1, G1, AgiBot, etc.).
arxiv.org
https://arxiv.org/pdf/2602.06949

Seonglae Cho