Google Releases DiffusionGemma: An AI That Generates Text Like an Image Generator

Forget typing letter-by-letter like a 90s chatbot. Google just dropped a mind-bending open model that treats words like pixels, painting whole paragraphs out of pure noise. This is the wildest shift in how AI "thinks" we've seen in years.

Traditional language models write like anxious students, sweating over every single word from left to right. But Google DeepMind's new DiffusionGemma works more like Midjourney or Stable Diffusion. It starts with a chaotic "canvas" of random placeholder tokens, gradually wiping away the digital noise over several passes to reveal a fully formed, coherent block of 256 tokens all at once.

This weirdly beautiful approach is built on the Gemma 4 26B A4B mixture-of-experts architecture, which has 26 billion parameters but only wakes up 3.8 billion of them at any given moment—sort of like a teenager's brain during a math class. Because of this lightweight footprint, a quantized version of the model squeezes comfortably into 18 GB of VRAM, meaning anyone with a decent consumer graphics card can run it at home.

But the real magic trick here is the sheer velocity. Traditional LLMs are notoriously throttled by memory bandwidth, but this diffusion trick shifts the heavy lifting over to raw computation, which modern GPUs have in absolute abundance. As a result, this beast clocks in at a blistering 700 tokens per second on an RTX 5090 and easily breaks the 1,000 token-per-second barrier on an enterprise NVIDIA H100 card.

Under the hood, a mechanism called Uniform State Diffusion allows the model to lock in the words it is absolutely sure about and use them as context clues for neighboring words. Unlike standard models that can never take back a typo once it’s printed, this bidirectional attention allows the AI to self-correct and edit its own mistakes on the fly within the block before showing its work.

To prove this isn't just academic showing off, Google tested the model on Sudoku puzzles—a task that normally makes classic sequential models cry in the corner. While the raw out-of-the-box model solved exactly zero percent of them, a quick JAX fine-tuning routine instantly bumped its success rate to 80% with insanely fast convergence.

There is, of course, a catch: the raw intellectual power of DiffusionGemma still lags behind the standard autoregressive model on general benchmarks, meaning it's currently more of a speed demon than a Nobel laureate.

The open-source community is already frantically packing this thing into frameworks like vLLM, MLX, and llama.cpp. While it might not replace daily research tools just yet, the era of sequential text generation is officially showing its age, leaving everyone to wonder if the industry is about to rebuild the entire LLM stack from scratch.

Source: Google Developers Blog

Comments

This is where the magic happens: AI reads your discussion and rewrites the article based on the most interesting comments. Each strong comment adds points to the meter below. Once the meter is full, the article updates live — no page reload needed.

3/24

Deprecated Chatbot

finally we can stop waiting for every single word to crawl onto the screen, this is literally the future of local running

+2 emotionalSomeone is clearly tired of watching their LLM type at the speed of a geriatric snail
Recursive Script-Kiddie

so it's faster but dumber? great, just what we needed, more high-speed garbage

+1 jokeA classic 'faster, but stupider' take, because apparently, we need more speed to generate our nonsense