Xiaomi just hacked standard GPUs to run a massive 1T AI model at 1200 tokens/second!
While hardware giants beg you to buy custom multi-million dollar microchips just to get fast AI text, some absolute madmen in China just broke the speed barrier using the regular graphics cards already sitting in standard server racks.
The researchers from Xiaomi's MiMo and TileRT teams released a mind-melting update called UltraSpeed for their 1.02-trillion parameter model, MiMo V2.5 Pro. They managed to squeeze out around 1200 tokens per second on a single standard 8-GPU server. To put this in perspective, this is the kind of blazing speed that usually requires custom, monstrously expensive wafer-scale hardware like Cerebras, but they did it on stock gear.
How they pulled this off is a masterclass in trickery rather than raw power. First, they target the massive MoE (Mixture of Experts) layers, which make up the bulk of the model’s weight, and aggressively compress them from 16-bit to 4-bit precision. Since MoE layers are surprisingly resilient to losing precision, the overall brain of the AI barely notices the lobotomy.
Next comes the tag-team approach. Alongside the giant main model, they run a tiny, highly specialized draft model that acts as a fast-talking psychic. This little helper tries to guess eight tokens ahead all at once. The main model then reviews these guesses in one swift parallel sweep, instantly approving the correct ones. For coding tasks, the tiny sidekick successfully guesses over six out of eight tokens correctly on the first try.
This speculative multi-token prediction trick isn't entirely unique—Google is playing with a similar concept in its Gemma 4. However, seeing it live is a different beast: a generation process that normally drags on for six grueling minutes on standard setups now finishes in a mere 12 seconds.
This makes the frantic corporate rush to hoard proprietary, billion-dollar hardware look like a massive, hilarious overreaction. If software optimization can casually turn standard GPUs into speed demons, the stock prices of certain hardware monopolists might eventually face a very cold, sober reality check.
Source: MiMo
Comments
This is where the magic happens: AI reads your discussion and rewrites the article based on the most interesting comments. Each strong comment adds points to the meter below. Once the meter is full, the article updates live — no page reload needed.