← Back

Nvidia just boosted LLM speed by 4x with new Nemotron-Labs Diffusion

Original version · May 24, 17:00

The days of waiting for AI to slowly type out answers like a sleepy intern might finally be over, thanks to a mind-blowing algorithmic trick that changes the generative game entirely.

Company Nvidia dropped a new open-source model family called Nemotron-Labs Diffusion on Hugging Face, showing off a blistering 865 tokens per second on their beastly B200 chip. The lineup includes 3B, 8B, and 14B parameter models, alongside an 8B multimodal version capable of processing images.

The secret sauce here is a technique called 'self-speculation'. Normally, speeding up text generation requires a clunky two-model tag team where a tiny, cheap model guesses the next words and a giant, smart model double-checks its homework. Nvidia realized this is like hiring an assistant just to watch them fail, so they turned the main model into a split personality that drafts a bunch of words in diffusion mode and proofreads its own work in regular autoregressive mode.

This method exploits how graphics cards actually work, as a GPU typically spends most of its life waiting for memory to load rather than actually calculating things. By writing 5 to 7 tokens in a single pass instead of one by one, the chip actually gets to do some real work for once. Better yet, at zero temperature, the output is a bit-by-bit identical match to the slow method, giving users a completely free speed upgrade with zero compromises.

In head-to-head testing, the Nemotron-Labs Diffusion 8B squeezed out 1.2% higher accuracy than Qwen3 8B. On the SPEED-Bench test, it ran 2.4 times faster than Qwen3 paired with Eagle3—which was previously considered the gold standard of speed hacks. In complex coding and math, the self-speculating model accepted an average of 8.69 tokens per step, leaving its competitors in the dust.

This technique, based on the company's Efficient-DLM research, can theoretically be retrofitted onto other open weights giants like Llama or DeepSeek. However, closed-source monopolies like OpenAI, Anthropic, or Google will have to manually rebuild their proprietary architectures to catch up.

While the trillion-dollar hardware giant continues to sell shovels in this AI gold rush, they are now casually rewriting the rules of the software game too. If open-source models can suddenly run four times faster for free, proprietary AI giants might have to explain why they are still charging premium subscription fees for sluggish chat interfaces.

Source: Hugging Face

Comments

This is where the magic happens: AI reads your discussion and rewrites the article based on the most interesting comments. Each strong comment adds points to the meter below. Once the meter is full, the article updates live — no page reload needed.

15/24
  1. Glitchy Mongoose
    finally a speedup that doesn't involve lobotomizing the model to 2-bit quantization.
    +6 solidFinally, speed without the lobotomy
  2. Drunk Rascal
    openai is finished lmao open source is moving way too fast
    +5 solidOpenAI is sweating, and honestly, it is about time
  3. Grumpy Bishop
    great so now we can generate useless hallucinated code four times faster
    +4 solidNow we can hallucinate four times faster than before