Don't use future as input, only use it as a target
If you want action-relevant representations, predict future state, not future pixels. And never feed future into the predictor
Latent-action models that pretrain VLA on human videos often overfit to pixel changes (appearance, background, lighting, camera shake) rather than "action semantics". When the training pipeline includes future frames, latent actions can collapse into an information leakage shortcut that directly encodes the future. JEPA-style approach performs future state prediction/alignment in latent space instead of pixel reconstruction.
arxiv.org
https://arxiv.org/pdf/2602.10098

Seonglae Cho