Skip to main content
Journal Club4 min read

When AI Joined the Cardiology Clinic: What the Trial Actually Showed

A genuine randomized trial put a medical language model beside nine cardiologists on 107 complex cases. The result is real — and narrower than the headline. It measured preference, not outcomes, and the system was its makers' own.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a cardiologist reading an ECG strip framed by a teal circle, with a second reviewer suggested behind and a single amber accent.

Most of what we are shown about medical AI is a benchmark: a model answering exam questions, or beating a test set it will never meet again. This study is different, and that is why it is worth an hour. Nine general cardiologists sat with 107 genuinely complex patients — people referred for suspected genetic cardiomyopathy — and managed them with and without an AI assistant, while three subspecialists who did not know which was which graded the work. It is, as clinical evaluations of language models go, about as honest a design as the field has produced.

The clinical problem it targets is real. Subspecialist expertise in inherited heart disease is concentrated in a handful of centres; in much of the United States there is none within reach, and most affected people are never diagnosed. Hypertrophic cardiomyopathy is also the commonest cause of sudden cardiac death in the young — a largely preventable death, when the diagnosis is made in time. The question the trial asks is not whether AI can dazzle, but whether it can lift an ordinary cardiologist's work closer to a specialist's.

What they actually did

The system is AMIE (Articulate Medical Intelligence Explorer), a research assistant built on Google's Gemini 2.0 Flash. Each patient was worked up by two of the nine cardiologists — one randomized to use AMIE, one not — both with access to the same multimodal record: ECGs, Holter monitoring, resting and stress echocardiograms, cardiac MRI reports, exercise testing. Genetic results were withheld from everyone, AMIE included. Three blinded Stanford subspecialists then scored every assessment against a ten-domain rubric covering triage, diagnosis and management. This is a randomized controlled trial — though one run on retrospective case data, not on patients followed forward in time. That distinction turns out to be the whole story.

What the evidence supports

The assisted cardiologists did measurably better on the things the AI is good at. Overall, the blinded subspecialists preferred the AMIE-assisted assessment 46.7 percent of the time against 32.7 percent for the cardiologist alone, with the rest judged a tie (P = 0.02). Clinically significant errors fell from 24.3 to 13.1 percent of cases (P = 0.033). Missing clinically relevant content — the omission that quietly harms — fell from 37.4 to 17.8 percent (P = 0.0021). The cardiologists themselves felt helped in 57 percent of cases and felt time was saved in half.

Read the domain breakdown, though, and the effect is precise rather than broad. The advantage sat in the management plan and in not leaving things out. On triage, on the diagnosis itself, and on framing the next diagnostic question — the domains that demand the sharpest judgement — there was no significant difference. The model helped most where the task was synthesis: pulling six investigations into a coherent plan and catching what a busy generalist forgets. It did not make anyone a better diagnostician.

What it does not support

Here the careful reader has to slow down. The endpoint was preference — what an expert thought of a written assessment — not a single patient outcome. Nobody followed these 107 people forward to see whether the assisted plans led to earlier diagnoses, fewer sudden deaths, or fewer needless tests. A more sharply reasoned, more complete note is plausibly better care; it is not the same thing as better care, and the trial cannot tell them apart.

Two further limits matter. The cases came from a single US centre, in English only, and the cardiologists were not blinded to whether they were using the tool. And the system being tested was built by Google, several of whose researchers co-authored the study and helped design the very rubric on which AMIE was judged. None of this is hidden — the authors are unusually candid — but a positive result on your own product, scored on your own scale, is the kind of finding that earns its weight only after someone else reproduces it.

The authors put their own verdict plainly: “It seems premature to deploy LLMs autonomously.”

Why it matters here

For European systems the structural problem is familiar: deep expertise pooled in a few academic centres, long waits, and geography deciding who gets a specialist opinion. If assistance of this kind can raise a general cardiologist's management to something nearer subspecialist quality, that is worth taking seriously — under supervision, with trained users, and with the regulatory care any clinical software demands. The useful question is no longer whether such tools will enter the clinic. It is the narrower, harder one the trial leaves open: under what conditions, with what oversight, and measured against which outcomes that actually reach the patient.

Source: O'Sullivan JW, Palepu A, Saab K, et al. A large language model for complex cardiology care. Nature Medicine 2026;32(2):616–623. A randomized controlled trial on retrospective cases, co-authored by the system's developers; its primary endpoint was expert preference, not patient outcomes.

#Journal Club#Clinical AI#Cardiology#Evidence-Based Medicine#Large Language Models

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.