LAM

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 15 19:54
Editor
Edited
Edited
2026 Jan 16 0:27
Refs
Refs

latent action world model

Latent Action Model (LAM) can be trained on in-the-wild videos without action labels, enabling prediction, transfer, and planning of diverse real-world actions
  • IDM(inverse dynamics model): (infers latent action by observing future frames)
  • Forward/world model:

→ Then we can get Latent Action Space

Key finding: discrete (VQ) latents cannot capture complex actions in the wild well. Key finding 3: learned latent actions converge to "camera-relative (local, spatially localized) transformations" instead of "universal embodiment". Learned latent actions converge to "camera-relative (local, spatially localized) transformations" instead of "universal embodiment".
Key insight: latent actions can be used as a "universal interface". By training a small controller to map real action → latent action, planning (CEM) performance on DROID (robot) / RECON (navigation) reaches near the level of action-labeled world model baselines (close, though not the absolute best).
Key risk: the latent can "cheat" by copying the next frame itself → therefore, information regularization is critical.
 
 
 

Recommendations