JiT
z_t is the noisy image (intermediate state) at time t. The definition is which is a value created by mixing the clean image x and Gaussian noise ϵ. What the model directly predicts, as in traditional diffusion, is ϵ or v (noise/velocity). However, patch ViT (JiT) directly predicts the clean image x. If needed, it converts this to v to perform the same diffusion update.
According to the paper's claims and experiments, ϵ/v-prediction tends to become unstable with large patches (increased token dimensionality), whereas x-prediction is more stable, allowing plain ViT to work.

Seonglae Cho