Journal Club5 July 20264 min read

More AI Health Tools Than Ever. The Question Is Whether They Help.

Accuracy on a benchmark is not the same as helping a patient, and almost no one has measured the second thing. A careful piece of reporting holds that line — and the two trials it leans on show why it matters.

Dr. Sven Jungmann

CEO

Editorial collage of a person at a kitchen table holding a phone with a teal screen beneath a large torn-paper question mark, with one amber accent.

The most uncomfortable number in this debate is not about the models. It is that people handed a language model to work out what is wrong with them did worse than people given an ordinary search and a phone — and far worse than the same model left to answer on its own. The tools are multiplying faster than anyone can test them, and where they have been tested properly, the result points the wrong way.

That is the gap Grace Huckins maps in MIT Technology Review, reporting that Microsoft, Amazon, OpenAI and Google have all shipped consumer health assistants — Microsoft alone fielding some 50 million health questions a day — while almost none of them have been evaluated by anyone other than the company selling them. This is a feature, not a study: read it as a well-sourced account of an evidence vacuum, with the actual weight resting on two papers it points to.

The number the leaderboards never show

The first is a preregistered randomized trial from the Oxford Internet Institute, published in Nature Medicine, that did the thing a benchmark cannot: it put real people in the loop. Some 1,298 participants worked through ten clinical scenarios, one group using a language model, a control group using whatever sources they would normally reach for. Prompted directly, the models were excellent — they named the relevant condition in roughly 95 percent of cases (GPT-4o 94.7 percent, with Llama 3 and Command R+ alongside). Put in human hands, the same models produced the right answer in under 34.5 percent. The decisive comparison is the one easy to miss: the people using AI did not merely fall short of the model — they did worse than the control group that had no AI at all.

Why is undramatic, which is precisely why it resists a software fix. Users leave out the detail the model needs; the model returns a confident blend of right and wrong that a frightened non-expert cannot separate; the exchange runs over several turns while the evaluations vendors publish grade a single tidy reply. A leaderboard measures the model. A patient experiences the conversation.

Calm, plausible, and wrong at the worst moment

The second paper, from Mount Sinai and also in Nature Medicine, asks a sharper safety question of one consumer system: when it matters most, does it send you to the right place? Across 60 scenarios spanning 21 specialties and 960 interactions, graded against physician consensus, the tool was steady in the middle of the severity range and unreliable at the edges — over-triaging 35 percent of non-urgent presentations and, more worryingly, under-triaging 48 percent of true emergencies. It handled the unmistakable cases, a stroke or anaphylaxis, but reassured users in the ambiguous ones even after surfacing dangerous symptoms. A tool that is wrong and obviously useless is a nuisance; one that is calm, fluent and wrong about a real emergency is a hazard, because the calm is what gets believed. The interface says it is 'not intended for diagnosis or treatment'; the reporting is candid that people use it for exactly that, at 2 a.m., when no clinic is open.

What the article can and cannot carry

As journalism, the feature proves nothing by itself; it borrows its authority from the trials and the researchers it quotes, and it is a snapshot of a moving target. The named models will be superseded, and progress is not even monotonic — the article notes that OpenAI's newer flagship is worse at asking for context than an earlier version, a quiet warning against assuming the next release fixes this. The defensible conclusion stays narrow. Where the right question has been asked, the distance between benchmark accuracy and patient benefit is large and pointing the wrong way; for most tools, the question has not been asked at all.

For anyone procuring or governing clinical AI, the lesson is not that AI is unsafe. It is about which evidence counts: the tool in real hands, judged by what happens to the patient, not the score on a test set. That evidence remains rare enough that its absence should be the default assumption until a study shows otherwise. Accuracy buys a seat at the table. It is not the proof that anyone was helped.

“A model can be right on the test set and still leave its user worse off than no AI at all. Only the second number ever reaches the patient.”

Source: Huckins G. There are more AI health tools than ever—but how well do they work? MIT Technology Review, 30 March 2026. This is secondary reporting, not primary research; its central claims rest on two peer-reviewed Nature Medicine papers — a preregistered randomized Oxford trial and a Mount Sinai triage evaluation — which carry the evidentiary weight here.

#Journal Club#Clinical AI#Health Policy#Evidence-Based Medicine#Patient Safety

More AI Health Tools Than Ever. The Question Is Whether They Help.

The number the leaderboards never show

Calm, plausible, and wrong at the worst moment

What the article can and cannot carry

Keep reading

A Model That Learned Infection From COVID-19 Caught Malaria It Never Trained On

A Small Open Model Won on One Dataset and Lost on the Next: Reading a Dementia-Speech Benchmark

Ninety Percent Started, Twenty-Six Finished: Germany's ePA in Hospitals

This analysis comes from the people behind Visite.

Want to see this in your hospital?