Prior JEPA methods are often unstable because they rely on complex polynomial losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to prevent representation collapse. This paper introduces LeWorldModel, the first JEPA that can be trained stably end-to-end from raw pixels. It reduces the number of key hyperparameters from six to just one.
The core of LeWM is a training objective with two loss terms. First, the next-embedding prediction loss trains a predictor to estimate the next latent state from the current latent state and action. Second, SIGReg (Sketched-Isotropic-Gaussian Regularizer) enforces that latent embeddings follow an isotropic Gaussian distribution to prevent collapse. SIGReg projects embeddings onto random unit-norm directions and applies the Epps–Pulley normality test: . The final objective is , using default values and $M = 1024$.
By the Cramér–Wold theorem, matching all 1D marginal distributions is theoretically equivalent to matching the full joint distribution. LeWM uses no stop-gradient, EMA, or additional stabilization heuristics. The only substantive hyperparameter, $lambda$, can be tuned via bisection search with complexity. The model has ~15M parameters, and at inference time it performs planning in latent space using Cross-Entropy Method (CEM)-based Model Predictive Control (MPC).
LeWorldModel: Stable End-to-End Joint-Embedding Predictive...
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term...
https://arxiv.org/abs/2603.19312


Seonglae Cho