Ant Group joins hands with Renmin University to release MoE diffusion model

Ant Group joins hands with Renmin University to release MoE diffusion model

```

There is now another path toward AGI.

On September 11, at the 2025 Bund Conference, Ant Group and Renmin University of China jointly released the industry's first diffusion language model with a native MoE architecture, "LLaDA-MoE".

It is reported that the two parties completed the training of a diffusion language model with a MoE architecture from scratch on approximately 20T of data, verifying the scalability and stability of industrial-scale large-scale training. The performance surpasses previously released dense diffusion language models LLaDA1.0/1.5 and Dream-7B, is on par with equivalent autoregressive models, and maintains several times the inference speed advantage. The model will be fully open-sourced soon.

This new model, through a non-autoregressive masked diffusion mechanism, for the first time achieves language intelligence (such as in-context learning, instruction following, code and mathematical reasoning, etc.) equivalent to Qwen2.5 in large-scale language models via native training of MoE, challenging the mainstream belief that "language models must be autoregressive."

Performance data shows that the LLaDA-MoE model outperforms LLaDA1.0/1.5 and Dream-7B diffusion language models in tasks such as code, mathematics, and Agent tasks, and is close to or surpasses the autoregressive model Qwen2.5-3B-Instruct, achieving the performance of an equivalent 3B dense model by activating only 1.4B parameters.

"The LLaDA-MoE model has verified the scalability and stability of industrial-scale large-scale training, which means we have taken another step forward on the path of scaling dLLM to an even larger scale," said Lan Zhenzhong at the release event.

Associate Professor Li Chongxuan of the Gaoling School of Artificial Intelligence at Renmin University of China introduced, "Two years have passed, and the capabilities of AI large models have made great progress, but some problems have never been fundamentally solved. The reason is that the currently prevailing autoregressive generation paradigm naturally models in a unidirectional way, sequentially generating the next token from front to back. This makes it difficult for them to capture the bidirectional dependencies between tokens."

In face of these issues, some researchers have chosen to take a different path, turning their attention to diffusion language models with parallel decoding. However, existing dLLMs are all based on dense architectures, making it difficult to replicate the "parameter scalability, computational efficiency" advantages of MoE in ARM. Against this industry background, the joint research team from Ant Group and Renmin University launched the first native diffusion language model LLaDA-MoE based on the MoE architecture.

Lan Zhenzhong also stated, "We will soon fully open source the model weights and our self-developed inference framework to the world, collaborating with the community to drive a new round of breakthroughs in AGI."

It is reported that the Ant Group and Renmin University team spent three months on this project, rewriting the training code based on LLaDA-1.0. They leveraged a series of parallel acceleration technologies such as EP parallelism provided by Ant's self-developed distributed framework ATorch, and achieved breakthroughs in core challenges such as load balancing and noise sampling drift using training data based on Ant's Ling2.0 base model. Ultimately, they efficiently trained approximately 20T of data using a 7B-A1B MoE architecture.

Under Ant's self-developed unified evaluation framework, LLaDA-MoE achieved an average improvement of 8.4% across 17 benchmarks including HumanEval, MBPP, GSM8K, MATH, IFEval, and BFCL. It leads LLaDA-1.5 by up to 13.2%, and is tied with Qwen2.5-3B-Instruct. The experiments again verified that the "MoE amplifier" law holds true in the dLLM field as well, providing a viable path for subsequent 10B–100B sparse models.

According to Lan Zhenzhong, in addition to the model weights, Ant Group will also synchronously open source an inference engine deeply optimized for dLLM parallelism. Compared to Nvidia's official fast-dLLM, this engine achieves significant acceleration. Related code and technical reports will be released soon on GitHub and the Hugging Face community.

Lan Zhenzhong also revealed that Ant Group will continue investing in areas including dLLM-based AGI, and in the next stage will work with academia and the global AI community to jointly drive new AGI breakthroughs. "Autoregression is not the end point; diffusion models can also become a main thoroughfare toward AGI," says Lan Zhenzhong.

Risk Warning and DisclaimerThe market has risks, and investment needs caution. This article does not constitute personal investment advice, nor does it take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular situation. Investment based on this is at your own risk. ```