NEPA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 1 11:58
Editor
Edited
Edited
2026 Jan 10 0:15
Refs
Refs
JEPA

Next-Embedding Prediction

Images are converted into a sequence of patch embeddings, and the next embedding is predicted autoregressively. Without pixel reconstruction, decoders, contrastive learning, or masking, strong visual representations are learned with a single objective of embedding prediction. After pretraining on ImageNet-1K using ViT, it achieved SOTA-level performance in classification and segmentation. The key components are causal masking + next-embedding prediction + stop-gradient, making it simple yet scalable. Similar to next-token prediction in language models, this suggests the possibility of unified pretraining across modalities.
Pixels → directly to continuous-valued patch embeddings, then predict the next embedding. With patch size p×p, a shallow embedding layer is applied followed by adding positional embeddings. Pretrain cost is comparable.
 
 
 
 
 
 

Recommendations