C-JEPA
Joint Embedding Predictive Architecture) is extended to object-centric representations, utilizing object-level masking as latent intervention. The key idea is to mask the slots of selected objects across the entire history window while preserving only the identity anchor at the earliest timestep.
Masked tokens are defined as , where is a linear projection and is temporal positional encoding. The training objective is the masked latent prediction loss , which decomposes into history reconstruction loss and future prediction loss .
Object-level masking imposes counterfactual-like queries during training to prevent shortcut solutions such as trivial self-dynamics or temporal interpolation, making interaction reasoning essential for minimizing the prediction objective. The predictor is a ViT-style bidirectional attention transformer, with auxiliary variables such as actions and proprioception conditioned as separate entities.
C-JEPA converts each frame into slot representations using a frozen object-centric encoder (VideoSAUR or SAVi), then a masked transformer predictor performs bidirectional joint inference over both history and future. It leverages strong semantic priors using DINOv2-based frozen features and performs prediction purely in latent space without reconstruction loss. During inference, only forward prediction is performed from fully observed history without masking. A limitation is that performance heavily depends on the quality of the object-centric encoder.
Causal-JEPA: Learning World Models through Object-Level Latent...
World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to...
https://arxiv.org/abs/2602.11389


Seonglae Cho