Skip to main content
Journal Club5 min read

Four Conversations About Clinical AI That Quietly Agree

Four NEJM AI podcast interviews, recorded months apart, keep landing in the same three places: a values vacuum, a bias we taught the machine, and a trust gap that tracks consequence. None of it is evidence. The agreement is still worth an hour.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of four people mid-conversation arranged around a teal circle with a single amber dot at its centre.

Fifty physicians were asked to work through clinical cases. Half had their usual resources; half had those plus GPT-4. A separate arm gave the model the same cases with no physician attached. The model on its own scored roughly sixteen percent above the doctors using conventional tools, and handing the doctors that same model did not reliably lift them (Goh et al., JAMA Network Open 2024). The trial belongs to Jonathan Chen's group at Stanford, and it sits awkwardly against a principle he, like most informaticians, had treated as close to a law: that a clinician working with a computer beats either alone.

That principle has a name — the fundamental theorem of biomedical informatics, Charles Friedman's formulation that a person plus an information resource outperforms the person unassisted. Chen revisits it on the NEJM AI Grand Rounds podcast, and his is one of four interviews I want to read together here. The others are with the cognitive psychologist Laura Zwaan, the informatician Zak Kohane, and Seth Hain of Epic. None of this is a study. There is no shared dataset, no protocol, no peer review of anything said into the microphone. What earns the set an hour is that four people who build and study clinical AI, talking separately across four months, keep arriving at the same three places.

Nobody is checking whose values went in

Kohane puts the sharpest of the three on the table. Models already arrive with dispositions — one hangs back, another reaches for the aggressive workup — and a disposition is not a neutral setting; it is a clinical stance with consequences for the patient in front of it. His observation is that no regulator looks at where those stances come from. Medicines and device authorities ask whether a tool is safe and whether it works. Neither question travels as far as the values a system has absorbed, or whose values they were to begin with. For software that tips a diagnosis or a referral one way rather than another, that is not a minor gap in the oversight.

A bias we trained into it ourselves

Zwaan has spent her career on how clinicians err, and on a subtler problem in how the rest of us study those errors. Once you know how a case ended, you cannot un-know it; reasoning that looked sound at the time reads as negligent in the rear-view mirror. Hindsight bias is less a flaw in error research than a permanent tenant of it. The line to AI is short. A model trained on labelled outcomes is learning from notes written after the ending was known — so it inherits not just our knowledge but the specific way our judgement bends once the answer is in. We then ask it to catch the mistakes we make for exactly that reason.

We move fast where it is cheap to be wrong

The fourth thread is about pace. Administrative AI — coding, billing, the revenue cycle — has slipped into routine use across health systems with little friction. Clinical AI, the kind that reaches a diagnosis or a treatment, has not, and the speakers are clear that the brake is not chiefly a technical one. Hain, describing how Epic approaches this, treats slow and careful deployment for anything that touches the patient as a choice made on purpose rather than a failure of nerve. The asymmetry gives the game away: we hurry where a mistake costs money and we hesitate where it costs a person. That instinct is, broadly, correct. It also means the cases that matter most are the ones still queued.

What this is, and is not

Read the set for what it is. These are the considered views of people with deep stakes in the field — Chen and Hain build the tools, and candour is not the same as disinterest; an interview is not a controlled comparison. Not one of the three claims here would hold up if you cited it as evidence. The honest version is that they are hypotheses, sharpened by people who would know, that happen to point the same way: an oversight blind spot over values, a bias handed down from us, and a deployment gap that follows consequence rather than difficulty.

A model trained on labelled outcomes inherits not just our knowledge but the specific way our judgement bends once the answer is in. We then ask it to catch the mistakes we make for exactly that reason.

For European decision-makers the practical lesson is modest and real. What will decide whether clinical AI earns its place is not a benchmark score. It is whose values a system encodes, how its training data was labelled and by whom, and whether the caution we rightly keep at the bedside is matched by the scrutiny we apply before the tool ever arrives there. Four people who disagree about a great deal agree about that much. The agreement is worth taking seriously — and worth not mistaking for proof.

Source: NEJM AI Grand Rounds, interviews with Jonathan Chen (15 Oct 2025), Laura Zwaan (19 Nov 2025), Zak Kohane (17 Dec 2025) and Seth Hain of Epic (18 Feb 2026), hosted by Arjun Manrai and Andrew Beam. The diagnostic-reasoning result is from Goh et al., JAMA Network Open 2024. The interviews are recorded conversations, not peer-reviewed research: the views are individual, several speakers build the systems they discuss, and nothing said in them should be read as primary evidence.

#Journal Club#Clinical AI#AI Governance#Diagnostic Error#Medical Informatics

Keep reading

Editorial collage of an oncologist's hands on a thick claims ledger, with a teal three-column bar chart rising only partway and a single amber accent.
Journal Club

An Explainable Model, Honest Numbers, and a Funder Worth Noticing

An explainable AI model predicted how long myeloma patients would stay on treatment, using twenty years of Japanese claims data and 647 variables. The discrimination is modest and fairly reported. The part that needs a careful eye is who paid, and which finding they got.

Dr. Sven JungmannCEO
Editorial collage of a surgeon's gloved hands beside an anaesthesia monitor showing a teal arterial-pressure waveform, with a closed operating-room door suggested behind and a single amber accent.
Journal Club

Surgical AI That Works in the Paper but Not in the Room

A scoping review screened 275 records to find every AI model meant to prevent surgical complications and follow it to the bedside. Of 19 studies, the models were often accurate. Two are in routine use — and the bottleneck is not the algorithm.

Dr. Sven JungmannCEO
Editorial collage of a tall column of faded document fragments narrowing through teal sieve layers down to a single white card marked by one amber dot.
Journal Club

Depression From Text: Why 3,067 Studies Came Down to 11

A meta-analysis of machine learning for detecting depression in text screened 3,067 papers and kept 11. The pooled signal is strong — but the prediction interval, from near-zero to near-perfect, is the finding that should travel.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.