Balancing Understanding and Generation in Discrete Diffusion Models

1University of Chinese Academy of Sciences, UCAS
2Xiaohongshu.Inc
* Corresponding Author
Github arXiv 🤗 HuggingFace

Abstract

In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM’s superior potential for long-term scaling.

Conclusion

In this paper, we presented the XDLM, a unified approach that theoretically bridges the gap between Masked and Uniform-noise diffusion. By redefining the forward process as a weighted row-stochastic transition, we proved that XDLM recovers existing paradigms (MDLM and UDLM) as special cases. To ensure practicality, we derived a memory efficient implementation that reduces computational complexity, enabling training on large vocabularies. Empirically, XDLM breaks the Pareto frontier between understanding and generation. We identified a mixing ratio of k = 0.1 as the optimal “sweet spot”, where the model combines the robust zero-shot likelihoods of masking models with the superior sample diversity and few-step generation quality of uniform noise models. These advantages extend across domains, achieving state-of-the-art on ImageNet-1K and demonstrating significant scalability in the continual pretraining of 8B-parameter LLMs, where XDLM doubled performance on code generation benchmarks.

Future Work

First, we have not yet trained XDLM from scratch at a large scale; such pre-training would likely allow for a more comprehensive exploration of the model’s emergent properties.

Second, we did not fully investigate the “performance crossover” phenomenon, wherein UDLM and XDLM appear to outperform MDLM in generation tasks involving large sampling steps (approaching autoregressive decoding).

Third, domain-specific sampling strategies for XDLM in language modeling and image generation have not yet been optimized.

Furthermore, while we confirmed that XDLM balances understanding and generation, the interaction and balance between textual and visual modalities within a single unified model remain uninvestigated.

Finally, the development of post-training schemas and inference acceleration techniques for XDLM remains a subject for future work.

Citation

If you find our work useful or interesting, please consider citing our paper:

@article{liu2026balancing,
title={Balancing Understanding and Generation in Discrete Diffusion Models},
author={Liu, Yue and Zhao, Yuzhong and Xie, Zheyong and Ye, Qixiang and Jiao, Jianbin and Hu, Yao and Cao, Shaosheng and Liu, Yunfan},
journal={arXiv preprint arXiv:2602.01362},
year={2026}
}