Diffusion Transformer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 28 13:2
Editor
Edited
Edited
2025 Nov 26 13:54

DiT

 
 
 
 
 
 
BERT/MLM is essentially a special form of text diffusion model, and by adding a masking schedule and iterative denoising, it can be used as a complete generative language model.
BERT (especially RoBERTa) is originally an encoder model that performs MLM (masked token recovery). However, by reinterpreting it as a discrete text diffusion process where the masking ratio changes like time steps, it can be transformed into a full generative model. By training with varying masking ratios from 0% to 100%, BERT can also generate text in a diffusion manner. By adding only variable masking and iterative denoising to RoBERTa and fine-tuning on WikiText, quite natural sentence generation was successfully achieved.

Omni

For image tokens apply diffusion loss for transformer omni
Traditional DiT (=Diffusion Transformer) relies on outdated VAE encoders → low-dimensional (latent 4ch), complex structure, weak expressiveness. Instead of VAE, using pre-trained representation encoders (DINO, MAE, SigLIP, etc.) + lightweight decoder combination = Representation Autoencoder (RAE). L1+
GAN
+
LPIPS
loss
 
 

Recommendations