NEPA

Next-Embedding Prediction

Images are converted into a sequence of patch embeddings, and the next embedding is predicted autoregressively. Without pixel reconstruction, decoders, contrastive learning, or masking, strong visual representations are learned with a single objective of embedding prediction. After pretraining on ImageNet-1K using ViT, it achieved SOTA-level performance in classification and segmentation. The key components are causal masking + next-embedding prediction + stop-gradient, making it simple yet scalable. Similar to next-token prediction in language models, this suggests the possibility of unified pretraining across modalities.

Pixels → directly to continuous-valued patch embeddings, then predict the next embedding. With patch size p×p, a shallow embedding layer is applied followed by adding positional embeddings. Pretrain cost is comparable.

arxiv.org

https://arxiv.org/pdf/2512.16922

NEPA

Next-Embedding Prediction

Backlinks

Recommendations