Next-Embedding Prediction
Images are converted into a sequence of patch embeddings, and the next embedding is predicted autoregressively. Without pixel reconstruction, decoders, contrastive learning, or masking, strong visual representations are learned with a single objective of embedding prediction. After pretraining on ImageNet-1K using ViT, it achieved SOTA-level performance in classification and segmentation. The key components are causal masking + next-embedding prediction + stop-gradient, making it simple yet scalable. Similar to next-token prediction in language models, this suggests the possibility of unified pretraining across modalities.
Pixels → directly to continuous-valued patch embeddings, then predict the next embedding. With patch size p×p, a shallow embedding layer is applied followed by adding positional embeddings. Pretrain cost is comparable.

Seonglae Cho