Google has released DiffusionGemma, an experimental open-source model that departs from the sequential token-by-token approach used by conventional language models and instead generates complete blocks of text in parallel. The company describes the model as capable of producing text up to four times faster than typical language models when run on purpose-built GPUs.
Architecturally, DiffusionGemma is a 26 billion parameter Mixture of Experts model. During inference the system activates only 3.8 billion parameters, and when quantized it can fit within an 18GB VRAM envelope typical of high-end consumer GPUs. That operational footprint is central to the model's appeal for local and interactive use cases.
Performance figures published by Google show DiffusionGemma exceeding 1,000 tokens per second on a single NVIDIA H100 GPU and topping 700 tokens per second on NVIDIA GeForce RTX 5090 hardware. The model produces 256 tokens in parallel in each forward pass, enabling bi-directional attention where every token can attend to all others. DiffusionGemma also iteratively refines its own outputs, making real-time corrections to its generations as it runs.
Google acknowledges a trade-off: while DiffusionGemma emphasizes generation speed and parallelism, its overall output quality is lower than that of standard Gemma 4 models. The company has positioned the model for researchers and developers focused on speed-critical and interactive local workflows - specifically citing scenarios such as in-line editing, rapid iteration, and the production of non-linear text structures.
DiffusionGemma has been released under an Apache 2.0 license on Hugging Face. Google says the model is compatible with a range of tooling and runtimes, including MLX, vLLM with Red Hat integration, Hugging Face Transformers, Unsloth, and NVIDIA NeMo.
On the hardware front, Google worked with NVIDIA to tune performance across multiple layers of the stack. Optimizations cover consumer-oriented GPUs such as GeForce RTX 5090 and 4090, as well as enterprise-grade Hopper and Blackwell systems running NVFP4 kernels.
For practitioners and organizations evaluating the trade-offs between speed and generation quality, DiffusionGemma presents a distinct option: substantially higher throughput through parallel diffusion-based generation in exchange for lower fidelity compared with Gemma 4.