Just image Transformer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 16 17:53
Editor
Edited
Edited
2025 Dec 16 19:7
Refs
Refs

JiT

z_t is the noisy image (intermediate state) at time t. The definition is which is a value created by mixing the clean image x and Gaussian noise ϵ. What the model directly predicts, as in traditional diffusion, is ϵ or v (noise/velocity). However, patch ViT (JiT) directly predicts the clean image x. If needed, it converts this to v to perform the same diffusion update.
According to the paper's claims and experiments, ϵ/v-prediction tends to become unstable with large patches (increased token dimensionality), whereas x-prediction is more stable, allowing plain ViT to work.
 
 
 
 
 
 

Recommendations