Think in Diffusion, Talk in Autoregression
A hybrid LLM architecture combining Diffusion and Autoregressive (AR). In a single forward pass, it performs parallel draft generation via Diffusion (Thinking) and final token sampling via AR (Talking) simultaneously.
Leverages GPU's free token slots to maximize parallelism while maintaining AR-level quality. Single model, single forward pass performs draft generation + verification in parallel (low serving overhead). Achieves 4.7×~5.9× tokens/sec speedup compared to AR. Quality nearly identical to AR, superior in both efficiency and quality compared to existing diffusion/speculative decoding approaches.

Attention Mask
GPU's free token slots are leveraged to maximize parallelism while maintaining AR-level quality. The bottleneck in AR decoding is not computation, but memory load. So in reality: within the same forward pass, adding a few more tokens → results in almost no increase in latency. These are free token slots (additional token positions that can be filled).
AR structurally: the next token must know the previous token to be computed, so even if free slots exist, they cannot be used (causal dependency). Diffusion LM: can predict multiple masked tokens simultaneously. Assumes token independence, so multiple masked tokens can be placed in free token slots and multiple tokens can be predicted in parallel at once.


Seonglae Cho