latent action world model
Latent Action Model (LAM) can be trained on in-the-wild videos without action labels, enabling prediction, transfer, and planning of diverse real-world actions
- IDM(inverse dynamics model): (infers latent action by observing future frames)
- Forward/world model:
→ Then we can get Latent Action Space
Key finding: discrete (VQ) latents cannot capture complex actions in the wild well. Key finding 3: learned latent actions converge to "camera-relative (local, spatially localized) transformations" instead of "universal embodiment". Learned latent actions converge to "camera-relative (local, spatially localized) transformations" instead of "universal embodiment".
Key insight: latent actions can be used as a "universal interface". By training a small controller to map real action → latent action, planning (CEM) performance on DROID (robot) / RECON (navigation) reaches near the level of action-labeled world model baselines (close, though not the absolute best).
Key risk: the latent can "cheat" by copying the next frame itself → therefore, information regularization is critical.

Seonglae Cho