Autoregressive-to-Diffusion

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 15 23:29
Editor
Edited
Edited
2025 Oct 15 23:49
Refs
Refs
Runway ML

A2D-VL

A model that converts Autoregressive VLM to Diffusion approach, enabling parallel generation
Fixed block size or noise are the main causes that degrade training quality. So annealing is applied as a scheduling technique during training, while inference operates with fixed settings. Token parallelism and the number of steps have an inverse relationship.
Decoding in smaller blocks of tokens improves quality and generalization to arbitrary-length responses. The diffusion unit is a block (8 tokens) and the attention unit is a token.
Vision Transformer
encoder +
Transformer Model Decoder
using block-causal attention to predict one block (e.g., 8 tokens) in parallel. All 8 tokens are in the same diffusion noise state → enabling parallel prediction
  • Block size annealing: Gradually expanding the size of token blocks predicted at once.
  • Noise level annealing: Progressively increasing masking difficulty from easy positions (left) to difficult positions (right).
 
 
 
 

Recommendations