Journal Club26 May 20264 min read

Physicians Ranked AI Answers Above Their Colleagues'. Read What They Were Comparing.

In 118 of 150 blind comparisons, physicians preferred GPT-4.0's answer to a doctor's. The figure is real. But the doctors' answers came from a Reddit forum, and no one ever checked which answer was correct.

Dr. Sven Jungmann

CEO

Editorial collage of three stacked answer-cards ranked by a teal bracket, two neat and one dog-eared and handwritten, with a faint forum thread behind and a single amber accent on the human card.

Two numbers should be held side by side. In 118 of 150 blind head-to-head comparisons — 78 percent — physicians preferred a GPT-4.0 answer to a doctor's. And in exactly zero of those comparisons did anyone check whether the preferred answer was correct. The first number is the one that travels; the second is the one that tells you how to read the first. A journal club exists to keep the two in the same sentence.

The study sits in JMIR Formative Research — and the journal's name is part of the appraisal. Formative Research is where early, hypothesis-generating work is meant to live, and the authors describe their own design as exploratory and descriptive rather than inferential. Fifty-two licensed physicians, recruited internationally through professional networks and email lists between March and May 2025, each saw three anonymised answers to a real patient question: one from GPT-4.0, one from Meta AI, one written by a verified doctor. They ranked the three for accuracy and responsiveness, best to worst, without knowing which was which. Pooled across raters, GPT-4.0 finished first (mean rank 1.63), Meta AI second (1.83), and the human doctors last (2.53). That ordering is reported correctly. The question is what produced it.

What the human answers actually were

Everything turns on the comparator. The physician answers were not letters, consultations, or discharge summaries. They were posts to Reddit's r/AskDocs forum, written by credentialed volunteers for anonymous strangers — without the patient's record, history, or examination, often quickly, with no duty of follow-up. A language model, meanwhile, returns a structured, complete, evenly-toned paragraph on every attempt. It never writes between two ward rounds and never has a bad day. So the contest was never AI against medicine. It was a polished machine paragraph against the most informal register a doctor ever writes in.

The authors flag this themselves. They note that the Reddit format "may not fully reflect the complexity of clinical communication in real-world practice," and that the construct of "preference" was nonspecific enough to mean different things to different raters. A preference scored on text alone rewards precisely what a fluent model is engineered to produce — coherence, completeness, an unhurried tone. None of those is a measure of whether the advice was right.

The part that is solid

Read at the strength the data allow, there is a real finding here. In a small, self-selected international sample, doctors reading blind consistently preferred well-formed model answers to anonymous forum answers — and the consistency is what makes it interesting. It held across the regions with enough respondents to say anything (North America, Africa, Asia) and across every experience band; even among the nine physicians with more than fifteen years in practice, AI answers outranked the human ones (1.75 versus 2.62). The preference is not the quirk of one cohort. On the narrow ground of structured written communication, it holds.

And the part that does not follow

It does not follow that the model was more accurate, because accuracy was never independently assessed. A confident, well-organised wrong answer scores well on a rubric that asks only which text the reader prefers, and this design has no way to catch one. It does not follow that AI communicates better than a physician who holds the chart, the examination, and responsibility for the outcome — that comparison was never run. The numbers are descriptive only: with 52 participants the authors performed no formal significance tests, so the data bear confidence intervals, not p-values. And the subgroups are thin to the point of being decorative — Europe, for instance, is represented by a single respondent. One person is not a region.

“A confident, well-organised wrong answer scores well on a rubric that asks only which text you prefer. This design has no way to catch one.”

Where it does point

Read for what it is, the study aims somewhere genuinely useful — not at the bedside, but at the search bar. Patients read health information online constantly, and against the scattered, informal advice they would otherwise find there, a model's legible, complete answer may well be the better text. That is a fair observation about digital health communication and the standards we ought to hold it to. It is not a statement about clinical care, and treating a preference for tidy prose as a verdict on medical competence is exactly the misreading the headline invites. The more interesting lesson is the narrower one: the machine has not out-doctored anyone. It has made the cost of writing badly impossible to ignore — and when doctors write as carefully as it does, the gap may simply close.

Source: Brooks JS, Blankson P-K, Campbell PM, et al. Assessment of Physician Preferences for Large Language Model-Generated Responses Across Geographic Regions and Clinical Experience Levels: Preliminary Survey Study. JMIR Formative Research 2026;10:e82487. A preliminary, exploratory online survey of 52 physicians with no formal significance testing, in which the human comparator was anonymous Reddit forum answers and the accuracy of any answer was never independently assessed.

#Journal Club#Clinical AI#Large Language Models#Health Communication#Evidence-Based Medicine

Physicians Ranked AI Answers Above Their Colleagues'. Read What They Were Comparing.

What the human answers actually were

The part that is solid

And the part that does not follow

Where it does point

Keep reading

Why aiomics for QM reports and quality analytics

The 4 p.m. Hazard: When Bad Software Becomes a Clinical Risk

The Value of AI Isn't Prediction. It's Cognitive Ergonomics.

This analysis comes from the people behind Visite.

Want to see this in your hospital?