Autoregressive-to-Diffusion

A2D-VL

A model that converts Autoregressive VLM to Diffusion approach, enabling parallel generation

Fixed block size or noise are the main causes that degrade training quality. So annealing is applied as a scheduling technique during training, while inference operates with fixed settings. Token parallelism and the number of steps have an inverse relationship.

Decoding in smaller blocks of tokens improves quality and generalization to arbitrary-length responses. The diffusion unit is a block (8 tokens) and the attention unit is a token.

Vision Transformer encoder +

Transformer Model Decoder using block-causal attention to predict one block (e.g., 8 tokens) in parallel. All 8 tokens are in the same diffusion noise state → enabling parallel prediction

Block size annealing: Gradually expanding the size of token blocks predicted at once.

Noise level annealing: Progressively increasing masking difficulty from easy positions (left) to difficult positions (right).

Runway Research | Autoregressive-to-Diffusion Vision Language Models

Efficient training of state-of-the-art diffusion vision language models.

https://runwayml.com/research/autoregressive-to-diffusion-vlms

Autoregressive-to-Diffusion

A2D-VL

Recommendations