Diffusion Transformer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 28 13:2
Editor
Edited
Edited
2025 Oct 27 12:36

DiT

 
 
 
 

Omni

For image tokens apply diffusion loss for transformer omni
Traditional DiT (=Diffusion Transformer) relies on outdated VAE encoders → low-dimensional (latent 4ch), complex structure, weak expressiveness. Instead of VAE, using pre-trained representation encoders (DINO, MAE, SigLIP, etc.) + lightweight decoder combination = Representation Autoencoder (RAE). L1+
GAN
+
LPIPS
loss
 
 

Recommendations