Google launches DiffusionGemma: Text generation is four times faster, mainly targeting local real-time application scenarios

Google launches DiffusionGemma: Text generation is four times faster, mainly targeting local real-time application scenarios

```

Google has released the open-source experimental model DiffusionGemma, which employs text diffusion technology to achieve up to four times faster text generation on dedicated GPUs, providing developers with a new technical path for low-latency local workflow scenarios.

This model is built upon the Google Gemma 4 series architecture and Gemini Diffusion research achievements, and is released under the open-source Apache 2.0 license.

Unlike traditional autoregressive large language models that generate words one by one, DiffusionGemma can generate 256 tokens in parallel during each forward pass. Actual output speeds on a single NVIDIA H100 exceed 1,000 tokens per second, while on an NVIDIA GeForce RTX 5090, output exceeds 700 tokens per second.

Google also pointed out that DiffusionGemma is still in the experimental phase, and its overall output quality is inferior to the standard Gemma 4 model. For production applications that demand the highest output quality, Google recommends continuing to deploy the standard Gemma 4.

Architectural Innovation: From "Typewriter" to "Printing Press"

The core technological breakthrough of DiffusionGemma lies in changing the way language models utilize hardware.

Traditional language models are like typewriters, generating text word by word from left to right. This mechanism works well on cloud servers, as servers can batch process thousands of user requests and share computing power. However, when the model runs locally for a single user, generating word by word leaves the GPU idle most of the time, severely wasting computational resources.

DiffusionGemma applies text diffusion methods, shifting the above bottleneck from memory bandwidth to computation.

The model first generates a group of random placeholder tokens on a "canvas," then refines them in multiple iterative rounds—in each round, it locks in confirmed tokens and uses them as contextual clues to revise the remaining content, ultimately converging into a complete paragraph output. Google compares this process to "upgrading a single typewriter to a large printing press capable of printing an entire page simultaneously."

It is worth noting that this speed advantage has clearly defined boundaries of applicability. Google notes that in high-throughput cloud service scenarios, autoregressive models can fully utilize computing power via batch processing, while DiffusionGemma’s parallel decoding advantage diminishes and may even increase service costs. Its throughput advantage is mainly reflected in low to medium batch size scenarios on a single accelerator.

Low Hardware Deployment Threshold, Supports Bidirectional Attention and Self-Correction

DiffusionGemma is a mixture-of-experts (MoE) model with 26B parameters, but only 3.8B parameters are activated during inference. After quantization processing, the model can run on a consumer-grade high-end GPU with 18GB of VRAM, lowering the hardware barrier for local deployment.

In terms of features, the model supports a bidirectional attention mechanism, allowing each token to attend to all other tokens within a paragraph during generation. Google believes this feature offers significant advantages in nonlinear generation tasks, including inline editing, code completion, amino acid sequence generation, and mathematical graphic construction scenarios.

The model also possesses intelligent self-correction capabilities, able to evaluate and revise entire paragraphs in real time during output.

Third-party AI tool company Unsloth has fine-tuned DiffusionGemma to successfully solve sudoku puzzles—tasks requiring forward reasoning that are challenging for traditional autoregressive models, while DiffusionGemma’s bidirectional attention mechanism makes processing such tasks more natural.

Positioning and Limitations: Experimental Exploration Rather Than Production Replacement

Google clearly positions DiffusionGemma for researchers and developers rather than as a direct replacement for existing production models. Its target use cases focus on speed-sensitive local interactive workflows, such as real-time text editing, rapid content iteration, and nonlinear text structure generation.

Despite its obvious speed advantages, Google admits DiffusionGemma’s output quality still trails standard Gemma 4, and it exhibits clear trade-offs in performance based on benchmark tests. This means that for commercial applications requiring high-accuracy output, the model currently does not meet the conditions to replace existing mainstream models.

Text diffusion technology itself is not a new concept and has been explored by the AI research community for many years, but applying it to large-scale models has long faced challenges.

The release of DiffusionGemma marks a quantifiable step by Google toward practical implementation of this research direction, and whether subsequent developments can achieve a better balance between quality and speed will be a continued focus of the market.

Risk Disclaimer and Exemption ClauseThe market entails risk, and investment must be made cautiously. This article does not constitute personal investment advice and does not take into account individual users' special investment goals, financial situation, or needs. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular circumstances. Investment based on this article is solely at one's own risk. ```