VLA-JEPA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 13 19:1
Editor
Edited
Edited
2026 Feb 13 19:9
Refs
Refs
VLA

Don't use future as input, only use it as a target

If you want action-relevant representations, predict future state, not future pixels. And never feed future into the predictor
Latent-action models that pretrain VLA on human videos often overfit to pixel changes (appearance, background, lighting, camera shake) rather than "action semantics". When the training pipeline includes future frames, latent actions can collapse into an information leakage shortcut that directly encodes the future. JEPA-style approach performs future state prediction/alignment in latent space instead of pixel reconstruction.
 
 
 
 
arxiv.org
 
 

Recommendations