Why Anthropic's smart AI agents code like gods but fail biology exams

While Silicon Valley screams about AI replacing everyone, it turns out our digital overlords are completely powerless against the true final boss: terrible web design. A new study shows why bots are great at coding but fail biology.

Software developers built their world to be automated, but biologists apparently still prefer the digital equivalent of dial-up internet. In a recent analysis, the team at Anthropic compared AI navigation to driving a modern sports car through a narrow, medieval town built way before anyone even thought of roads.

While coding enjoys standardized APIs, package managers, and version control, bioinformatics remains a chaotic patchwork of custom scripts and ancient databases. AI agents struggle in the scientific field not because they lack reasoning skills, but because they lack clean, machine-friendly ways to access biological data.

This mess becomes a matter of life and death during real crises. Right now, the Democratic Republic of Congo is fighting a deadly Ebola outbreak. To track mutations, scientists must compare the virus genome with older samples. But to get this data from the public NCBI Virus database, researchers—and AI agents—have to manually click through a clunky web interface that looks like a high school computer science project from the nineties.

Even tech legends are feeling this pain. Former Tesla AI director Andrej Karpathy recently complained about the "click tax" of modern web development, noting that writing the actual code of a simple app took him a few hours, while configuring authentication and deployment via browser dashboards took a whole week of mindless clicking.

To measure the scale of this scientific tragedy, researchers created a benchmark called VirBench, throwing models like Claude Sonnet 4 and GPT-5.5 into the biological wild. The results were hilariously unstable: overall accuracy swung wildly between 16.9% and 91.3%. On a simple request for Ebola data, Sonnet 4 went completely rogue, returning 106 sequences on the first run, 15 on the second, and just 5 on the third, instead of the actual 266.

The savior turned out to be a simple, deterministic tool called gget virus, which bypasses the messy web interfaces and aggregates various database APIs automatically. With this tool in hand, accuracy shot past 90% for all models, with GPT-5.5 hitting an near-perfect 99.7% while eliminating the mood swings between runs.

This simple upgrade proved that a cheap, basic AI model equipped with proper tools can easily outperform an expensive, bloated flagship model struggling with a manual search interface.

The hard truth is that the bottleneck of future scientific discovery isn't the intelligence of neural networks, but the laziness of human database designers. Building smarter AI is useless if the machines still have to waste compute power navigating broken dropdown menus and ancient web portals.

Source: Anthropic

Why Anthropic's smart AI agents code like gods but fail biology exams

Comments