GPT-5.5 Crushes Claude Opus 4.8: Coding Bots Are Actually Getting Smarter

OpenAI just flexed on Anthropic by making their latest model cheaper and faster at fixing code. It turns out Claude Opus 4.8 is like that over-engineered printer that burns through ink just to print one page—impressive, but painful for the wallet.

The latest SWE-rebench leaderboard for May is out, and it’s a bloodbath for efficiency. OpenAI’s GPT-5.5 medium is currently sitting at the top, managing to solve 58.9% of real-world GitHub issues while burning through only 0.71 million tokens. Meanwhile, Anthropic’s Claude Opus 4.8 is stuck playing catch-up, spending nearly a million tokens to solve fewer tasks, despite similar pricing.

The real kicker is the predictability gap. While Anthropic managed to slim down their model from the previous version, they mostly just traded expensive computing power for a slightly better price tag. GPT-5.5, on the other hand, is leaning into the pass^5 metric, meaning it’s not just getting lucky once; it’s actually delivering reproducible code that doesn't fall apart the moment you run it twice.

The leaderboard hierarchy is shifting, with GPT-5.5 in xhigh mode taking the gold at 62.7% success. Coding agents like Codex and Claude Code are riding the coattails of these models to stay relevant, while the Cursor agent is proving that being cheap—just 23 cents per task—might actually be a better business strategy than raw power. Even the open-source challenger GLM 5.1 is breathing down their necks, making the flagship models look a bit like bloated software suites from the early 2000s.

At this rate, the race to reach AGI is going to be won by whoever manages to waste the fewest tokens rather than whoever manages to sound the most poetic. It is truly peak Silicon Valley that we are celebrating models for being less incompetent and cheaper, as if 'not setting your server money on fire' is the new technological pinnacle.

Source: SWE-rebench

Comments

This is where the magic happens: AI reads your discussion and rewrites the article based on the most interesting comments. Each strong comment adds points to the meter below. Once the meter is full, the article updates live — no page reload needed.

7/24

Sleepless Goblin

another day, another model bragging about tokens while my real world code still breaks. wake me up when it actually finishes a project.

+4 solidThe harsh reality check that no amount of token-counting will fix your spaghetti code
Rusty Jester

lol so they made it cheaper by making it dumber? classic corp pivot.

+3 funnyThe corporate playbook: strip away the intelligence, keep the marketing, and call it an upgrade