Just image Transformer

JiT

z_t is the noisy image (intermediate state) at time t. The definition is which is a value created by mixing the clean image x and Gaussian noise ϵ. What the model directly predicts, as in traditional diffusion, is ϵ or v (noise/velocity). However, patch ViT (JiT) directly predicts the clean image x. If needed, it converts this to v to perform the same diffusion update.

According to the paper's claims and experiments, ϵ/v-prediction tends to become unstable with large patches (increased token dimensionality), whereas x-prediction is more stable, allowing plain ViT to work.

arxiv.org

https://arxiv.org/pdf/2511.13720v1

Just image Transformer

JiT

Recommendations