DiT
BERT/MLM is essentially a special form of text diffusion model, and by adding a masking schedule and iterative denoising, it can be used as a complete generative language model.
BERT (especially RoBERTa) is originally an encoder model that performs MLM (masked token recovery). However, by reinterpreting it as a discrete text diffusion process where the masking ratio changes like time steps, it can be transformed into a full generative model. By training with varying masking ratios from 0% to 100%, BERT can also generate text in a diffusion manner. By adding only variable masking and iterative denoising to RoBERTa and fine-tuning on WikiText, quite natural sentence generation was successfully achieved.
Omni
For image tokens apply diffusion loss for transformer omni
Traditional DiT (=Diffusion Transformer) relies on outdated VAE encoders → low-dimensional (latent 4ch), complex structure, weak expressiveness. Instead of VAE, using pre-trained representation encoders (DINO, MAE, SigLIP, etc.) + lightweight decoder combination = Representation Autoencoder (RAE). L1+GAN+LPIPS loss

Seonglae Cho
