Balancing Understanding and Generation in Discrete Diffusion Models

1University of Chinese Academy of Sciences, UCAS
2Xiaohongshu Inc.
* Corresponding Author
arXiv XDLM LLaDA-XDLM 🤗 Paper 🤗 Model Collections Papers.cool

Abstract

In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM’s superior potential for long-term scaling.

Highlights

XDLM demonstrates strong potential in practical applications. By fine-tuning LLaDA, it achieved a high score of 15.0 on the MBPP code generation benchmark in just 32 generation steps. This represents nearly a twofold performance increase compared to the baseline. As shown in the figure on the right, this significant boost is primarily attributed to LLaDA-XDLM's ability to drastically reduce generation failures.

The study identifies two dominant paradigms in existing discrete diffusion models, each with distinct performance trade-offs. MDLM (Masked Diffusion Language Model): This mask-based paradigm excels in comprehension tasks (such as zero-shot perplexity) and generation with sufficient steps, but suffers a sharp decline in performance during few-step generation. UDLM (Uniform-noise Diffusion Language Model): In contrast, this uniform-noise-based paradigm significantly outperforms MDLM in few-step generation scenarios but lacks in overall comprehension ability.

Based on these observations, we proposed XDLM. By utilizing a unified stationary noise kernel, XDLM integrates the strengths of both paradigms, aiming to achieve both superior model comprehension and efficient few-step generation performance simultaneously. As illustrated below, the core idea of XDLM is the ingenious combination of the noise kernels from UDLM (u) and MDLM (m), achieving an effective trade-off between the two. The left side of the figure demonstrates the noise kernel mixing mechanism, where [NORMAL] denotes standard tokens and [MASK] represents masked tokens. The right side intuitively reveals the balance between comprehension (zero-shot perplexity) and generation efficiency (perplexity under 32-step sampling). Experiments show that XDLM reaches its optimal balance point at a mixing ratio of 0.1.

To establish the theoretical foundation for the practical application of this unified noise kernel, we further derives its corresponding posterior probability (for model inference) and KL divergence (for model training), while also providing its limiting form.

This critical simplification significantly conserves computational resources, enabling XDLM to achieve an exceptional balance between throughput and GPU memory usage. As shown in the table below, this advantage is particularly prominent during the sampling (inference) stage: XDLM's throughput is more than 2.4x that of UDLM, while its memory footprint is substantially reduced, demonstrating superior engineering efficiency.

Benefiting from this mixed noise mechanism, XDLM demonstrates robust error-correction and refinement capabilities across both text and image tasks. The figure below illustrates a segment of the sequence generation process at step T = 8/32, vividly depicting its three inherent transition dynamics. Green: Generating new content from [MASK] tokens (Content Creation). Blue: Refining existing tokens (Quality Enhancement). Red: "Re-masking" inappropriate tokens for regeneration in subsequent steps (Error Correction).

This robust corrective capability extends seamlessly to image generation tasks, where the model iteratively optimizes and rectifies.

Furthermore, we also conduct a comprehensive evaluation of XDLM, spanning text generation on the OWT and LM1B datasets, image generation on ImageNet-1K and CIFAR-10 (including scenarios with Classifier-Free Guidance), and common zero-shot generalization benchmarks. The figure below illustrates the text generation performance as a function of sampling steps on the LM1B and OWT datasets.

A significant finding is that XDLM and UDLM, both of which incorporate uniform noise components, exhibit continuous improvements in generation performance as training scale increases, ultimately surpassing the mask-only MDLM. This phenomenon has been consistently validated across both text and image generation tasks, providing robust support for the future scalability of diffusion models that leverage uniform noise.

Conclusion

In this paper, we presented the XDLM, a unified approach that theoretically bridges the gap between Masked and Uniform-noise diffusion. By redefining the forward process as a weighted row-stochastic transition, we proved that XDLM recovers existing paradigms (MDLM and UDLM) as special cases. To ensure practicality, we derived a memory efficient implementation that reduces computational complexity, enabling training on large vocabularies. Empirically, XDLM breaks the Pareto frontier between understanding and generation. We identified a mixing ratio of k = 0.1 as the optimal “sweet spot”, where the model combines the robust zero-shot likelihoods of masking models with the superior sample diversity and few-step generation quality of uniform noise models. These advantages extend across domains, achieving state-of-the-art on ImageNet-1K and demonstrating significant scalability in the continual pretraining of 8B-parameter LLMs, where XDLM doubled performance on code generation benchmarks.

Future Work

First, we have not yet trained XDLM from scratch at a large scale; such pre-training would likely allow for a more comprehensive exploration of the model’s emergent properties.

Second, we did not fully investigate the “performance crossover” phenomenon, wherein UDLM and XDLM appear to outperform MDLM in generation tasks involving large sampling steps (approaching autoregressive decoding).

Third, domain-specific sampling strategies for XDLM in language modeling and image generation have not yet been optimized.

Furthermore, while we confirmed that XDLM balances understanding and generation, the interaction and balance between textual and visual modalities within a single unified model remain uninvestigated.

Finally, the development of post-training schemas and inference acceleration techniques for XDLM remains a subject for future work.

Citation

If you find our work useful or interesting, please consider citing our paper:

@article{liu2026balancing,
title={Balancing Understanding and Generation in Discrete Diffusion Models},
author={Liu, Yue and Zhao, Yuzhong and Xie, Zheyong and Ye, Qixiang and Jiao, Jianbin and Hu, Yao and Cao, Shaosheng and Liu, Yunfan},
journal={arXiv preprint arXiv:2602.01362},
year={2026}
}