LAM

latent action world model

Latent Action Model (LAM) can be trained on in-the-wild videos without action labels, enabling prediction, transfer, and planning of diverse real-world actions

IDM(inverse dynamics model): (infers latent action by observing future frames)

Forward/world model:

→ Then we can get Latent Action Space

Key finding: discrete (VQ) latents cannot capture complex actions in the wild well. Key finding 3: learned latent actions converge to "camera-relative (local, spatially localized) transformations" instead of "universal embodiment". Learned latent actions converge to "camera-relative (local, spatially localized) transformations" instead of "universal embodiment".

Key insight: latent actions can be used as a "universal interface". By training a small controller to map real action → latent action, planning (CEM) performance on DROID (robot) / RECON (navigation) reaches near the level of action-labeled world model baselines (close, though not the absolute best).

Key risk: the latent can "cheat" by copying the next frame itself → therefore, information regularization is critical.

arxiv.org

https://arxiv.org/pdf/2601.05230

LAM

latent action world model

→ Then we can get Latent Action Space

Recommendations