Diffusion Transformer

Created
Created
2024 Feb 28 13:2
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2026 Jan 3 22:37

DiT

Diffusion loss is applied to image tokens.
 
 
 
 
 
BERT/MLM is essentially a special form of text diffusion model, and by adding a masking schedule and iterative denoising, it can be used as a complete generative language model.
BERT (especially RoBERTa) is originally an encoder model that performs MLM (masked token recovery). However, by reinterpreting it as a discrete text diffusion process where the masking ratio changes like time steps, it can be transformed into a full generative model. By training with varying masking ratios from 0% to 100%, BERT can also generate text in a diffusion manner. By adding only variable masking and iterative denoising to RoBERTa and fine-tuning on WikiText, quite natural sentence generation was successfully achieved.

TREAD

All tokens use the same diffusion loss per image token. However, this means each token has a different loss, which breaks the gradual denoising assumption of diffusion. So shouldn't we just apply diffusion loss to all image tokens? Instead of discarding thread tokens, let's temporarily route them through a different path (routing). All tokens receive diffusion loss.
Traditional DiT (=Diffusion Transformer) relies on outdated VAE encoders → low-dimensional (latent 4ch), complex structure, weak expressiveness. Instead of VAE, using pre-trained representation encoders (DINO, MAE, SigLIP, etc.) + lightweight decoder combination = Representation Autoencoder (RAE). L1+
GAN
+
LPIPS
loss
 
 

Recommendations