VLA-JEPA

Don't use future as input, only use it as a target

If you want action-relevant representations, predict future state, not future pixels. And never feed future into the predictor

Latent-action models that pretrain VLA on human videos often overfit to pixel changes (appearance, background, lighting, camera shake) rather than "action semantics". When the training pipeline includes future frames, latent actions can collapse into an information leakage shortcut that directly encodes the future. JEPA-style approach performs future state prediction/alignment in latent space instead of pixel reconstruction.

arxiv.org

https://arxiv.org/pdf/2602.10098

VLA-JEPA

Don't use future as input, only use it as a target

Recommendations