A2D-VL
A model that converts Autoregressive VLM to Diffusion approach, enabling parallel generation
Fixed block size or noise are the main causes that degrade training quality. So annealing is applied as a scheduling technique during training, while inference operates with fixed settings. Token parallelism and the number of steps have an inverse relationship.
Decoding in smaller blocks of tokens improves quality and generalization to arbitrary-length responses. The diffusion unit is a block (8 tokens) and the attention unit is a token. Vision Transformer encoder + Transformer Model Decoder using block-causal attention to predict one block (e.g., 8 tokens) in parallel. All 8 tokens are in the same diffusion noise state → enabling parallel prediction
- Block size annealing: Gradually expanding the size of token blocks predicted at once.
- Noise level annealing: Progressively increasing masking difficulty from easy positions (left) to difficult positions (right).

Seonglae Cho