Diffusion Language Models are Super Data Learners

Diffusion Language Models are Super Data Learners

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh
†Correspondence to: Jinjie Ni <[email protected]>
Released on Aug 09 2025
Full Paper: link
GitHub:
 
Cite this work

Recent research highlights the potential of diffusion language models (DLMs). Owing to the parallel decoding design, they can generate thousands of tokens per second, resulting in exceptionally low latency for real-world applications [17][18][19]. Moreover, several recent DLMs have demonstrated performance on par with autoregressive (AR) models [8][9].
But is speed their only advantage? After rigorous investigations over the past few months, we discovered a more striking trait: diffusion models are super data learners under fixed data budgets. That is, given the same number of unique pre-training tokens, diffusion models consistently outperform AR counterparts of equal size—by trading additional FLOPs for improved learning. This reflects a roughly >3x data potential of AR models.
Such data potential is increasingly valuable as we approach the limits of available pre-training data [20], especially given that AR models show diminishing returns after just four epochs of data reuse [11]. Coincidentally, a concurrent study [1] explores similar topics. However, our careful analysis reveals several methodological issues in [1] that may lead to flawed conclusions.
In this post, we present preliminary results providing strong evidence for a clear “crossover” point where diffusion models outperform AR models. We then delve into the learning behavior of diffusion models to shed light on how this advantage emerges. Finally, we offer a detailed critique of the problematic methodologies in [1], aiming to guide more robust future research.
 

Highlights

  • We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. DLMs demonstrate > 3x greater data potential compared to autoregressive (AR) models. Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. Note that more repetitions could further improve its performance, as no signs of diminishing returns were observed.
  • DLMs are super-dense models that consume more FLOPs than dense AR models. Training DLMs to fully leverage the data typically demands at least two orders of magnitude more FLOPs. During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines. In addition, the more expressive bidirectional attention enabled by the diffusion objective allows bidirectional modeling of the language data, which is not fully causal, to fully squeeze its value.
  • Our concurrent work, “Diffusion Beats Autoregressive in Data-Constrained Settings”, contains methodological issues potentially leading to problematic conclusions, including problematic diffusion loss formulation, invalid metrics for comparison, unfair settings for AR models, and problematic scaling law formulation. All of which might lead to potentially misleading results and conclusions.
 
Table of Content

1. Preliminary Results

Section Highlights
  1. We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. Under unique data constraints, DLM clearly outperforms the AR counterparts at some point by repeating the data, demonstrating >3x data potential compared to autoregressive (AR) models. Notably, the crossover point on different evals are similar.
  1. Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. More repetitions could further improve its performance, as no signs of diminishing returns were observed.
  1. Models that “overfit” on the validation set often keep improving on downstream tasks. After overfitting, absolute NLL rises due to overconfidence, but △NLL continues to widen—indicating preserved or even enhanced discriminative ability.
  1. Though being robust to data repetition, DLMs also get overfit–as we train them for enough long epochs. Larger unique data size delay overfitting, while larger models accelerate its onset.

1.1 The Intelligence Crossover

Figure A: The performance comparison of autoregressive (AR) and masked diffusion models (Diffusion) when repeating on a limited portion of data. All models are trained on 96B total tokens (including repetition), varying the unique tokens from 0.5B to 96B. Diffusion models exploit the data better through more repetition on limited unique data. More unique tokens requires more repetition to see the crossover, where the high unique token runs postpone the crossover beyond our 96B token observation scope.
Figure A: The performance comparison of autoregressive (AR) and masked diffusion models (Diffusion) when repeating on a limited portion of data. All models are trained on 96B total tokens (including repetition), varying the unique tokens from 0.5B to 96B. Diffusion models exploit the data better through more repetition on limited unique data. More unique tokens requires more repetition to see the crossover, where the high unique token runs postpone the crossover beyond our 96B token observation scope.
 
Overall Setup: Dense 1B/8B models trained on a fixed 96B-token budget, varying unique tokens from 0.5B to 96B. A 1B DLM was also trained for 480 epochs on 1B unique tokens.
Figure A presents an extensive set of results, providing compelling evidence that, by repeating on normal web data, masked DLMs outperform AR counterparts across model sizes in data-constrained settings, demonstrating significantly greater potential without encountering performance saturation.
Overall, our results suggest DLMs exhibit more than threefold greater ultimate data potential compared to autoregressive models. This estimate is empirically supported within our experiments, as DLMs trained on only 0.5B unique tokens (not converged) achieve comparable performance to AR models trained on 1.5B unique tokens (converged). Increasing the model size from 1B to 8B further unleashes the data potential, while AR doesn’t benefit from a larger model size under data constraint. DLMs also show negligible performance degradation when drastically reducing unique data from 96B to 0.5B tokens.
Under compute-bound scenarios—where data supply is abundant—AR models fit the training data better and thus achieve superior end-of-training performance. However, under data-bound conditions—reflecting the current reality of rapidly increasing compute power outpacing data availability—diffusion models significantly surpass AR models at some point. A deeper analysis of this phenomenon is presented in Section 2.
Notably, the crossover point on different evals are similar (only two general-domain evals in this case). As we increase the number of unique tokens, the crossover point—where DLMs overtake AR models—postpones (0.5-1.5B) or even shifts further outside our observable range (1.5-96B). Such postpone is more clearly observed in Section 1.3. The weak gap between the 10B and 96B runs arises from the fact that the crossover points are largely postponed and what we observed in this 0-96B window is just an initial pattern, and typically initial differences are weaker than the later ones.
Figure B. The 1B-parameter DLM—trained solely on the original 1B pre-training tokens for 480 epochs—achieves ~56% accuracy on HellaSwag and ~33% on MMLU.
Figure B. The 1B-parameter DLM—trained solely on the original 1B pre-training tokens for 480 epochs—achieves ~56% accuracy on HellaSwag and ~33% on MMLU.
To study the full potential of tokens in DLM training, we launched an additional run in which the same 1B-token dataset was repeated for 480 epochs, yielding a total of 480B training tokens. Notably, it achieves ~56% accuracy on HellaSwag and ~33% on MMLU, significantly outperforming AR’s ~41% and ~29%, respectively. Surprisingly, even under such extreme repetition, performance did not saturate, suggesting that DLMs can extract substantially more signal from a fixed 1B-token corpus.

1.2 High Validation Loss ≠ Degraded Intelligence

In this section, we demonstrate why downstream evaluation results are more critical than validation loss when comparing diffusion and AR models, and why we need to present a large set of benchmark data points spanning the whole observation scope instead of showing only a single data point.
Figure C: When models get “overfit” on pre-training validation sets, their performance on down-stream evaluations doesn’t necessarily drop, and may keep improving till the end of training.
Figure C: When models get “overfit” on pre-training validation sets, their performance on down-stream evaluations doesn’t necessarily drop, and may keep improving till the end of training.
We observe that the autoregressive models exhibiting signs of "overfitting"—indicated by an increase in validation loss—continue to improve on downstream tasks, as illustrated in Figures A and B. This phenomenon arises because validation loss is measured as an absolute Cross-Entropy loss (Negative Log-Likelihood, NLL), whereas accuracy on multiple-choice benchmarks depends on comparing the relative Cross-Entropy losses across options. Consequently, changes in absolute NLL values do not necessarily translate into changes in their relative ordering.
 
Figure D: An illustration of why the models’ performance keeps growing after they overfit on pre-training validation sets (indicated with dashed line). NLL: Negative log-likelihood on the ground-truth and other options of multiple-choice evals (NLLs on other options are averaged). △NLL: The differences between the NLLs on ground-truth and other options, which keeps growing. This is a 1B autoregressive model trained on 1.5B unique tokens, 64 epochs, on both out-of-domain and in-domain pre-training data.
Figure D: An illustration of why the models’ performance keeps growing after they overfit on pre-training validation sets (indicated with dashed line). NLL: Negative log-likelihood on the ground-truth and other options of multiple-choice evals (NLLs on other options are averaged). △NLL: The differences between the NLLs on ground-truth and other options, which keeps growing. This is a 1B autoregressive model trained on 1.5B unique tokens, 64 epochs, on both out-of-domain and in-domain pre-training data.
In Figure D, we visualize the average negative log-likelihood (NLL) for the ground-truth and alternative options across multiple-choice evaluations, along with their respective differences (△NLL), during the pre-training of a 1B-parameter autoregressive model over 1.5B unique tokens for 64 epochs. Notably, even at the first validation checkpoint (after 3,600 training steps), the model already exhibits substantially lower NLL (higher likelihood) on the ground-truth options, indicating an early capacity to preferentially assign higher logits to correct choices. As training continues, the model begins to overfit, causing an increase in NLL values for both ground-truth and incorrect options. Interestingly, even after this "overfitting," the gap between ground-truth and alternative NLLs continues to widen consistently, indicating that the model's underlying discriminative ability continues to improve despite the rise in validation loss. This phenomenon persists for both in-domain and out-of-domain training data.
One plausible explanation is that repeated exposure to a limited set of training data causes the model to become excessively confident on certain text segments, amplifying NLL values for incorrect predictions. Nevertheless, the persistent growth in relative NLL differences between ground-truth and other options reflects continuous improvement in the model’s discriminative power. A similar rationale applies to generative evaluations, where choices are made at the token rather than sentence level, and we hypothesize that being mistakenly overconfident on non-essential tokens has limited impact on the overall task. This hypothesis will be further investigated in our forthcoming study with larger-scale models trained on larger unique data, as models trained under small computational budgets typically fail to show a smooth trend in generation-based evaluations.

1.3 Diffusion Language Models also Overfit the Data

Figure E: Validation loss curves for models trained with various sizes and unique data budgets, repeating up to 1000 epochs. Diffusion language models will also overfit. The more unique data we train on, the later it overfits; the larger model we train, the earlier it overfits.
Figure E: Validation loss curves for models trained with various sizes and unique data budgets, repeating up to 1000 epochs. Diffusion language models will also overfit. The more unique data we train on, the later it overfits; the larger model we train, the earlier it overfits.
In Figure A, we did not observe diminishing returns, let alone overfitting, when training DLMs with extremely limited unique data (down to 0.5B tokens) over a large number of epochs (up to 480 epochs). This leads us to investigate: Do DLMs eventually overfit given sufficient training?
We trained models across various sizes and unique data budgets, extending training up to 1000 epochs. As illustrated in Figure E, when the unique data is sufficiently small and the model size sufficiently large, overfitting eventually emerges after prolonged training.
Specifically, we observe that the epoch at which a model begins to overfit positively correlates with the unique data size and negatively correlates with model size. In other words, larger unique data size delay overfitting, while larger models accelerate its onset.
It is important to note that validation loss overfitting does not immediately imply a decline in model capability—actual performance degradation typically occurs much later (e.g., as seen in Figure A at 0.5B tokens and 192 epochs).
 
Key Experimental Settings
It is worth noting that the hyperparameters adopted in our experiments are primarily optimized for AR models, reflecting extensive prior tuning efforts by the broader LLM research community. Although we aimed to maintain identical settings across AR and diffusion models, this is inherently unfair for diffusion models. Consequently, the observed performance advantages of diffusion models could be under-estimated.
All training runs were conducted using a significantly modified Megatron-LM codebase. Cross-over experiments were trained on a subset of the Nemotron-CC corpus [2], while all other experiments utilized a subset of the c4-en corpus [3]. Note that all token budgets used are randomly sampled from these corpus, without any special process. We used the same masked diffusion objective as in [8], detailed in Equation 2. Specifically, we employed a batch size of 256, a sequence length of 2048, and a warmup-stable-decay (WSD) learning rate schedule peaking at 2e-4 with 1000 warmup steps, followed by a 10% exponential decay to 2e-5. Model parameters were randomly initialized from a normal distribution with s.t.d. 0.02. We adopted a performant architectural configuration, incorporating the GPT-2 tokenizer, RoPE, SwiGLU, pre-layer RMSNorm, bias-free, and qk normalization. Validation loss was evaluated on the c4-en validation set using distinct 100M-token subsets per evaluation, and benchmark evaluations strictly adhered to official protocols.

2. Diffusion Language Models are Super Data Learners

Caveat: This section mainly focuses on theoretical analysis based on some evidences without careful ablations.
Section Highlights
  1. DLMs are super data learners because (1) their bidirectional modeling, enabled by the diffusion objective and the bidirectional attention, more thoroughly squeeze the web data, which is not fully causal; (2) their computational super-density—more FLOPs per task—translates directly into greater intelligence.
  1. AR models prioritize compute efficiency over data utilization. Their transformer design—with teacher forcing and causal masking—maximizes GPU usage but limits modeling capacity. As compute becomes cheaper, data availability emerges as the key bottleneck—motivating our study of DLMs.
  1. The diffusion objective explicitly requires each data point in the pre-training dataset to be corrupted at multiple masking ratios and combinations for effective training, offering another insight why more data repetitions bring so much gain.

2.1 What is the Real Advantage of Diffusion Language Models?

[1] made this discussion in the abstract:
“We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization.”
The “data augmentation” opinion might not touch the core idea. Injecting noise to the data is indeed a way to benefit the model generalizability, which is quite common in vision domain, but it’s not a game changer–you can also inject noise to AR models, while it won’t lead to practical gains [21].
Instead, the real advantage lies in the fact that the diffusion objective unlocks a way to model the real world data bidirectionally, with the more expressive bidirectional attention. This advantage is a game changer leading to two major direct benefits:
  • Reduced inductive bias via any-order modeling. Autoregressive language models impose a strict causal inductive bias on textual data modeling, where each token prediction is conditioned solely on preceding tokens. While natural language exhibits inherent left-to-right causality from a human perspective, evidence indicates that modeling language in reverse or arbitrary order remains feasible [12]. Moreover, numerous non-causal data types, such as source code, database entries, symbolic notations, and biological sequences, etc., frequently appear online. Thus, enforcing a purely causal inductive bias significantly restricts capturing the rich patterns embedded in diverse textual distributions. DLMs remove this inductive bias with an objective enabling such any-order modeling, fully squeezing the value of every single data point.
  • Super-Density: more training and test time FLOPs per task. As illustrated in Figure F, diffusion models outperform autoregressive counterparts by repeatedly processing some portion of unique data during training, effectively scaling FLOPs along the temporal dimension. The continuous-time objective utilized by masked language models is particularly advantageous, enabling high granularity in temporal FLOPs scaling. Similarly, at inference, diffusion models iteratively refine predictions, further amplifying computational density per task. Notably, bidirectional attention implies each token is computed up to N times to generate a sequence of length N, contrasting with autoregressive models using KV cach, which compute each token only once.
    • An analysis on the training and inference FLOPs comparison (Figure F).
      Figure F: (left) The diffusion language models are approximated to consume >100 time more FLOPs than AR counterparts to achieve their full potential in training (where the peak performance is usually much greater than AR). (middle) The theoretical inference FLOPs controlling the sampling steps to be equal to the sequence length. The total inference FLOPs have a power-law relationship with the generation sequence length for both. (right) The theoretical inference FLOPs controlling the generation sequence length, where sampling 512 steps from an AR model with kv cache ≈ sampling 1 step from the masked diffusion model.
      Figure F: (left) The diffusion language models are approximated to consume >100 time more FLOPs than AR counterparts to achieve their full potential in training (where the peak performance is usually much greater than AR). (middle) The theoretical inference FLOPs controlling the sampling steps to be equal to the sequence length. The total inference FLOPs have a power-law relationship with the generation sequence length for both. (right) The theoretical inference FLOPs controlling the generation sequence length, where sampling 512 steps from an AR model with kv cache ≈ sampling 1 step from the masked diffusion model.
      Figure F compares the FLOPs of autoregressive (AR) and masked diffusion models during training and inference. At training time, our preliminary experiments indicate diffusion models require at least about two orders of magnitude (>100×) more FLOPs than AR models to reach optimal performance, with the exact figure varying depending on model size and data budget. During inference, given a fixed number of sampling steps, masked diffusion models consume between 16× and 4700× more FLOPs per task, with this gap widening as the target sequence length increases from 16 to 4096 (Figure F middle). Moreover, for a constant sequence length, FLOPs consumed by diffusion models scale linearly with the sampling steps. In theory, diffusion models can generate an N-token sequence within a single step, whereas AR models inherently require N sequential steps. However, due to the KV-cache mechanism, the computational cost for AR models generating N tokens is roughly equivalent to that of diffusion models performing a single sampling step. It's worth noting that a significant portion of diffusion model computations can be parallelized. Thus, in practice, before GPU compute bound is reached, the inference speed gap between diffusion and AR models at the same number of sampling steps remains acceptable. Additionally, advances in GPU architectures specifically optimized for compute-intensive workloads may further mitigate this performance gap in the near future.
Reflecting on the LLM history, many recent intelligence leaps, such as T5 [3], GPT-3 [13], and o1 [14], are the direct results of FLOPs scaling. Trust the god of FLOPs!

2.2 Autoregressive Models are Trading Data Potential for Compute Efficiency

The autoregressive (AR) modeling methodology (decoder-only transformer architecture with teacher-forcing and causal masking) is a legendary local optimum in the AI history. Its success can be broken down into two factors:
  • Optimal utilization of modern GPU architectures: AR achieves an exceptionally high signal-to-FLOPs ratio during training and a high Model FLOPs Utilization (MFU) during batched inference. During training, each token in a batch consistently receives gradient signals, approximately twice the expected signals compared to masked diffusion models with linear schedules. Indeed, it is challenging to identify alternative methodologies surpassing AR in terms of signal-to-FLOPs efficiency. At inference, token-by-token generation naturally facilitates throughput optimization techniques such as continuous batching, maximizing MFU. Thus, AR stands as an exceedingly robust and efficient baseline method.
  • Natural language can be causally modeled with low loss: Empirically, pre-training on web-scale corpora demonstrates that left-to-right modeling consistently attains lower loss compared to alternative sequence orders (see Figure 2 of [12]). If one must select a single sequence order for language modeling, the left-to-right order is empirically optimal (Eq. 3), as it effectively captures natural language patterns. It’s also easy to interpret this: most text data are generated by humans, and humans are RNNs. However, as previously discussed, purely left-to-right modeling inherently misses certain contextual dependencies, indicating room for improvement in data potential.
Figure 2 of [12]. Convergence speed with different fixed prediction orders: left-to-right, fixed random, and fixed block-wise random. (b) Impact of adding 10% left-to-right (L2R) data to AO-GPT training on its L2R and any-order loss.
Figure 2 of [12]. Convergence speed with different fixed prediction orders: left-to-right, fixed random, and fixed block-wise random. (b) Impact of adding 10% left-to-right (L2R) data to AO-GPT training on its L2R and any-order loss.
Currently, there is an emerging trend in which computational resources are becoming increasingly affordable, shifting the primary constraint for scaling intelligence towards data availability. Consequently, for researchers targeting advanced intelligence, the previous emphasis on maximizing GPU utilization has diminished. The inherent modeling limitations imposed by causal masking are now becoming unacceptable. This motivates our exploration of DLMs, which intentionally sacrifice computational efficiency to achieve higher data efficiency—representing an approach diametrically opposed to autoregressive methods.
To strike a favorable balance between these two extremes, a natural strategy is interpolation, as exemplified by block diffusion methods [15]. However, achieving comparable training efficiency remains challenging: block diffusion inherently conditions each generated block on a clean context, significantly constraining training efficiency compared to the highly efficient teacher-forcing paradigm employed in autoregressive training.

2.3 The Loss Tells Us to Repeat the Data

When conducting multi-epoch training for masked DLMs, we effectively transform each unique data point into multiple noisy variants. Specifically, the loss function for masked diffusion models includes an expectation term , placed outside the negative log-likelihood component. Here, represents the distribution of masked sequences conditioned on the clean input at diffusion timestep , as determined by the forward corruption process. Intuitively, this means averaging the loss across all possible masking configurations at each time step .
In other words, the objective function explicitly requires each data point in the pre-training dataset to be corrupted at multiple masking ratios and combinations for more effective training, by estimating a more precise expectation. Thus, data repetition emerges inherently from the diffusion model's objective rather than from an arbitrary source. Open-source models, such as LLaDA, typically corrupt each data point only once, likely due to computational limitations, approximating the expectation term using a single-sample Monte Carlo estimator.

3. Major Problems in “Diffusion Beats Autoregressive in Data-Constrained Settings [1]”

Section Highlights
  1. First of all, we wish to highlight that some conclusions in [1] are valid and here we only focus on the problematic methodologies and conclusions.
  1. All experiments of [1] use a problematic loss formulation, which might lead to misleading conclusions.
  1. Validation loss is not a good metric for AR and diffusion comparison for that
    1. [1] uses early checkpoints for AR benchmark results, resulting in unfair comparisons.
    1. [1] compares overfitting trends between AR and diffusion models using a larger model and a smaller set of unique training tokens for AR—an unfair setup, as larger models trained on less diverse data are inherently prone to earlier overfitting.
    1. The scaling law used in [1] assumes a non-decreasing validation loss, which fails in practice due to overfitting-induced loss increases. This flawed assumption leads to poor fits and biases any conclusions derived from its predictions.

    3.1 Problematic Metrics for Comparison

    a. Problematic diffusion loss formulation

    We notice that the authors modified the original draft to add a linear time-dependent reweighing in the latest arXiv submission (v3), while we will keep the assumption that all experiments used Equation 1, as the loss range in Figure 4(b) of [1] closely match the behavior expected from Equation 1. We look forward to the release of the codebase (at the time of this post it’s still an empty repo) and the relevant replications from the community. Update on Aug 11: The authors of [1] did some experiments to verify: . Glad to see that! Update on Aug 26: The authors release the code at: . Welcome to verify!
     
    All experiments in [1] employ the following loss function without explicit justification:
    which can be interpreted as summing the cross-entropy loss over the expected masked tokens . Here the denotes the -th element of the sentence sample ; is the forward process corrupting the clean data to a noisy one at time step t, and is the predicted distribution parameterized by .
    However, the above loss differs significantly from the theoretically-grounded and widely adopted masked diffusion language modeling loss defined as follows:
    which was theoretically derived by several previous works, i.e., SEDD [4], RADD [5], MD4 [6], and [7]. Meanwhile, it’s the commonly-accepted loss function in current state-of-the-art DLMs, such as LLaDA [8], Dream [9], [16], DiffuCoder [10], etc.
    Concretely, the loss (1) misses the time-dependent reweighing . In theory, we can prove loss (1) is not faithfully representing the model likelihood (see H.4 of the MD4 [6] paper for detailed discussion). While loss (2) is proven to be the upper-bound of negative log-likelihood.
    This may lead to problems on the conclusions in [1]. Since all experiments of [1] are based on loss (1), every single conclusion might be problematic, including the crossover existence, crossover position, power law fitting/prediction, the critical compute frontier, and even the benchmark results (the benchmark results are selected based on the best validation loss, which behavior itself is also problematic, and will be discussed later).

    b. Is validation loss a good metric for AR and DLM comparison?

    Short answer: when the loss formulation is problematic, certainly not, as they’re not representing the same thing; when the loss formulation is correct, still not, as
    • One is measuring exact negative likelihood, while another is an upper bound.
    • A lower loss does not mean better capability, evidenced in Section 1.2.
    Auto-regressive models use a cross entropy loss, which computes the exact negative likelihood with the chain rule:
    However, the masked DLMs are computing an upper bound of the negative log-likelihood [5][6][16], as indicated in Equation 2 and below:
    Comparing an exact likelihood to an upper bound is inherently flawed, resulting in biased conclusions.
    Moreover, as discussed in Section 1.2, a lower validation loss does not necessarily imply improved downstream task performance. As depicted in Figure E, validation loss increases after a certain point due to overfitting, whereas benchmark performance continues to improve until the end. Consequently, many of the conclusions drawn in [1], which rely exclusively on validation loss, are fundamentally problematic.

    3.2 Problematic Experimental Settings for AR and Diffusion Comparison

    a. The AR benchmark results presented are far from the best

    Table 2 in [1] presents results from "models with the best validation loss". As indicated in Section 4.4 of [1], the diffusion model "observed no signs of convergence", implying that the reported results for diffusion models correspond to the final training checkpoint. In contrast, autoregressive (AR) models exhibit an increase in validation loss early in training, leading [1] to select an early checkpoint. As elaborated in Section 1.2, benchmark metrics continue to improve for AR models even after validation loss begins rising, suggesting that the reported AR results are suboptimal and consequently render the comparison significantly unfair. Figure G illustrates this discrepancy using training curves from our own experiments.
    Figure G: An illustration figure explaining why the AR benchmark results presented in [1] are far from the best. Note that the training curves shown here are taken from our results just for illustration, not from [1].
    Figure G: An illustration figure explaining why the AR benchmark results presented in [1] are far from the best. Note that the training curves shown here are taken from our results just for illustration, not from [1].

    b. The overfitting comparison is largely unfair for AR models

    Figure 4 of [1], “Training curves for different epoch counts, all using the same total compute”. Here the parameter used is 217M for AR models and 117M for Diffusion models and hence the AR models consumes less unique tokens (the x-axis represents the total tokens). Larger model size and less unique tokens will by default more prone to overfitting, as discussed in Section 1.3 of this blog. View the source of the above numbers here.
    Figure 4 of [1], “Training curves for different epoch counts, all using the same total compute”. Here the parameter used is 217M for AR models and 117M for Diffusion models and hence the AR models consumes less unique tokens (the x-axis represents the total tokens). Larger model size and less unique tokens will by default more prone to overfitting, as discussed in Section 1.3 of this blog. View the source of the above numbers here.
    Figure 5 of [1], “we use the compute-optimal model and dataset size derived from single-epoch scaling laws and extend training across multiple epochs.” Here the AR models are larger than Diffusion models and hence the AR models consumes less unique tokens under this setting. Larger model size and less unique tokens will by default more prone to overfitting, as discussed in Section 1.3 of this blog.
    Figure 5 of [1], “we use the compute-optimal model and dataset size derived from single-epoch scaling laws and extend training across multiple epochs.” Here the AR models are larger than Diffusion models and hence the AR models consumes less unique tokens under this setting. Larger model size and less unique tokens will by default more prone to overfitting, as discussed in Section 1.3 of this blog.
     
    [1] made the below statement based on Figure 4 and 5 of [1] to claim:
    AR models overfit with increased repetition, showing diverging loss curves. In contrast, diffusion models exhibit overlapping curves across repetitions, indicating no signs of overfitting and a very low decay rate with data reuse.
    and
    We find that for AR models, repeated data provides nearly the same benefit as fresh data only up to about 4 epochs. Beyond this point, additional repetition yields diminishing returns. In contrast, diffusion models continue to match the unique-data curve for up to 100 epochs, indicating a far greater capacity to benefit from repeated data in data-constrained regimes.
    However, the experimental setups used to substantiate these claims are fundamentally flawed. As illustrated in Figure E, increasing the model size or reducing the unique tokens in the training dataset significantly accelerates the onset of overfitting. Yet, Figures 4 and 5 in [1] employ substantially larger models and fewer unique tokens for AR compared to Diffusion, rendering the comparisons inherently unfair. Even with identical architectures, as elaborated in Section 1.3, a larger model size coupled with fewer unique training tokens inevitably leads to premature overfitting. Additional details are provided in the captions of the aforementioned figures.

    3.3 Problematic Scaling Law Fitting Methodology

    Following [11], [1] trained AR and masked DLMs with various sizes and data budgets to fit the below loss functional form:
    Where and denotes the model parameter and dataset size, respectively; , , , , , , are coefficients fitted with the training data points. and mean “effective parameters and dataset size” with diminish returns.
     
    Figure 4 of [11]. This figure is used as an example of the actual validation loss shapes for autoregressive models, questioning the validity of the non-increasing scaling law formula used in [1] and [11].
    Figure 4 of [11]. This figure is used as an example of the actual validation loss shapes for autoregressive models, questioning the validity of the non-increasing scaling law formula used in [1] and [11].
     
    It is easy to verify that is non-increasing w.r.t. and , a condition that implicitly enforces a restrictive assumption on the validation loss shape. However, as illustrated in the Figure 4 of [11] and Figure E, this assumption does not consistently hold (also noted but left unaddressed in Section F of [11]). Specifically, validation loss may increase due to model overfitting on training data, rendering any conclusions drawn from the predictions of loss formulation (5) biased. For instance, the monotonic trends observed in Figure 5 of [1] with respect to epochs will never happen, and predictions shown in Figure 6 of [1] might similarly be overshooting.
    Additionally, [1] overlooks another critical aspect: although masked diffusion models exhibit robustness to data repetitions, prolonged training inevitably leads to overfitting and the rise in validation loss, as discussed earlier in Section 1.3.

    Closing Remarks

    This blog presents compelling evidence that DLMs are super data learners, achieving substantially higher data efficiency under a fixed unique data budget—positioning them as a highly promising architectural paradigm. It further explores the practical trade-offs between DLMs and AR models, offering grounded insights. Additionally, it critiques methodological shortcomings in contemporaneous studies, identifying key sources of error and aiming to provide more reliable conclusions that can inform and inspire future research.
    I strongly encourage the community to post more critical blogs as a more effective “open review”. Especially now, as traditional conference reviews increasingly lose credibility, robust and transparent community feedback is crucial for advancing science—not just AI—toward healthier and more rigorous standards.

    Citation

    If you find this blog useful, please consider citing (author list will be updated):

    References

    [1] Prabhudesai, Mihir, et al. "Diffusion Beats Autoregressive in Data-Constrained Settings." arXiv preprint arXiv:2507.15857 (2025, version 2).
    [2] Su, Dan, et al. "Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset." arXiv preprint arXiv:2412.02595 (2024).
    [3] Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." Journal of machine learning research 21.140 (2020): 1-67.
    [4] Lou, Aaron, Chenlin Meng, and Stefano Ermon. "Discrete diffusion modeling by estimating the ratios of the data distribution." arXiv preprint arXiv:2310.16834 (2023).
    [5] Ou, Jingyang, et al. "Your absorbing discrete diffusion secretly models the conditional distributions of clean data." arXiv preprint arXiv:2406.03736 (2024).
    [6] Shi, Jiaxin, et al. "Simplified and generalized masked diffusion for discrete data." Advances in neural information processing systems 37 (2024): 103131-103167.
    [7] Sahoo, Subham, et al. "Simple and effective masked diffusion language models." Advances in Neural Information Processing Systems 37 (2024): 130136-130184.
    [8] Nie, Shen, et al. "Large language diffusion models." arXiv preprint arXiv:2502.09992 (2025).
    [9] Ye, Jiacheng, et al. “Dream 7B.” https://hkunlp.github.io/blog/2025/dream (2025)
    [10] Gong, Shansan, et al. "DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation." arXiv preprint arXiv:2506.20639 (2025).
    [11] Muennighoff, Niklas, et al. "Scaling data-constrained language models." Advances in Neural Information Processing Systems 36 (2023): 50358-50376.
    [12] Xue, Shuchen, et al. "Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture." arXiv preprint arXiv:2506.19935 (2025).
    [13] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
    [14] OpenAI. "Introducing OpenAI o1” https://openai.com/o1/ (2025)
    [15] Arriola, Marianne, et al. "Block diffusion: Interpolating between autoregressive and diffusion language models." arXiv preprint arXiv:2503.09573 (2025).
    [16] Nie, Shen, et al. "Scaling up masked diffusion models on text." arXiv preprint arXiv:2410.18514 (2024).
    [17] Khanna, Samar, et al. "Mercury: Ultra-Fast Language Models Based on Diffusion." arXiv e-prints (2025): arXiv-2506.
    [19] ByteDance Seed. “Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference.” "https://lf3-static.bytednsdoc.com/obj/eden-cn/hyvsmeh7uhobf/sdiff_updated.pdfj” (2025)
    [20] Xue, Fuzhao, et al. "To repeat or not to repeat: Insights from scaling llm under token-crisis." Advances in Neural Information Processing Systems 36 (2023): 59304-59322.
     

    Recommendations