Skip to main content
Journal Club5 min read

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.

The summaries were written at a Flesch–Kincaid grade level of 10.6 — roughly tenth-grade reading difficulty. The prompt had asked for sixth-to-eighth grade. That single gap, between what a language model was told to produce for families and what it actually produced, is a small window onto a larger question this study is unusually honest about: who is the right judge of a clinical text meant for patients?

The study, published in JMIR AI on 10 February 2026 by a group from the Lucile Packard Children's Hospital at Stanford, is small and clear about its scope. The team took the assessment-and-plan sections of two consecutive daily progress notes for each of 50 children on a paediatric cardiovascular intensive care unit and summarised them with GPT-4o mini. Eight paediatric cardiologists and ten parents of children with heart disease then rated those summaries — two clinicians and two parents per summary, on different rubrics, for different things. The interesting move is not the summarisation. It is the decision to ask two kinds of reader the same underlying question and watch them part ways.

Where the readers diverge

Asked how helpful the summaries would be for families, parents answered warmly: three helpfulness items scored between 3.25 and 3.36 on a four-point scale. Clinicians, judging that same dimension — would this help a family? — landed at 2.97. A Mann–Whitney U test confirms the gap is real rather than noise (U = 3897; z = 2.69; P = .007). It is a modest distance in absolute terms and a clear one statistically: the people the tool is for and the people who would sign off on it do not rate its usefulness the same way.

On their own clinical yardstick the cardiologists were measured. Accuracy came in at 3.19, completeness at 3.04, no-revision-needed at 2.96, clinical alignment at 2.90 — competent, not glowing. Agreement within each group was only moderate: a Krippendorff α (a measure of inter-rater reliability, where 1 is perfect agreement) of 0.69 among clinicians and 0.75 among parents. So the disagreement is not merely clinicians-versus-parents; there is real spread inside each camp too.

The finding is methodological, not a score

Read carelessly, the headline is a polite tie: parents liked it a bit more, clinicians a bit less. Read carefully, the contribution is elsewhere. Patient-facing AI is almost always evaluated the way clinical decision-support is — by clinical experts, against clinical criteria such as accuracy, completeness and fidelity to the source. For software that informs a clinical decision, that is exactly the right test. For software whose job is to help a worried relative follow what is happening, it measures the wrong thing well.

A rubric that only asks whether a summary is clinically accurate cannot tell you whether the mother at the bedside understood it.

Whether a text is clinically precise tells you nothing about whether the mother at the bedside understood it, whether it helped her shape the next conversation with the team, or whether it sharpened her worry instead of ordering it. Those are separate questions, and they need separate judges. A clinical rubric not capturing this is not a defect in the rubric — it is its boundary. What the study does well is make that boundary visible, and it returns to the readability number to do so: the warm parent ratings (readability scored 3.36) sit awkwardly next to a tenth-grade reading level, and probably reflect a health-literate volunteer group rather than parents in general.

What it does not support

The authors name their limits plainly, and they bind. This is a single centre, a highly specialised US institution and a narrow subspecialty, which constrains generalisability; the model overshot the readability it was told to hit; the raters only moderately agreed. The sharpest caveat is the easiest to overlook: parents rated summaries of other children's notes, not their own. That deliberately removes the emotional stake — a parent reading about their own critically ill child — that the tool would carry in real use. The reading captured here is calmer than the one that matters, which means even the parent scores may understate the stakes rather than overstate the benefit.

None of this dents the central claim, because the claim is narrow and well-chosen: judging a patient-communication tool by clinical fidelity alone risks the wrong verdict in either direction. A tool that scores middling with clinicians yet genuinely serves families better than what they had will be undervalued by a purely clinical assessment — and a tool that reads cleanly to experts but confuses families will be overvalued by it. This is a pilot raising a structural question in a thinly studied area, not a study that settles it.

The relevance to European hospitals investing in AI-assisted documentation is direct. The criteria that decide procurement and rollout are almost always clinical and administrative — efficiency, documentation quality, interoperability, data protection. Whether patients and relatives actually understand the generated text, find it useful, and can carry it into their own conversations with the team is rarely measured at all. The quiet argument of this paper is simply that someone should.

Source: Han B, Barnes T, Reddy CD, Shin AY. Evaluating Large Language Model–Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study. JMIR AI 2026;5:e85221. A single-centre retrospective pilot with no funding and no declared conflicts of interest; it raises a methodological question rather than settling it.

#Journal Club#Clinical AI#Patient Communication#Evidence-Based Medicine#Large Language Models

Keep reading

Editorial collage of an oncologist's hands on a thick claims ledger, with a teal three-column bar chart rising only partway and a single amber accent.
Journal Club

An Explainable Model, Honest Numbers, and a Funder Worth Noticing

An explainable AI model predicted how long myeloma patients would stay on treatment, using twenty years of Japanese claims data and 647 variables. The discrimination is modest and fairly reported. The part that needs a careful eye is who paid, and which finding they got.

Dr. Sven JungmannCEO
Editorial collage of four people mid-conversation arranged around a teal circle with a single amber dot at its centre.
Journal Club

Four Conversations About Clinical AI That Quietly Agree

Four NEJM AI podcast interviews, recorded months apart, keep landing in the same three places: a values vacuum, a bias we taught the machine, and a trust gap that tracks consequence. None of it is evidence. The agreement is still worth an hour.

Dr. Sven JungmannCEO
Editorial collage of a surgeon's gloved hands beside an anaesthesia monitor showing a teal arterial-pressure waveform, with a closed operating-room door suggested behind and a single amber accent.
Journal Club

Surgical AI That Works in the Paper but Not in the Room

A scoping review screened 275 records to find every AI model meant to prevent surgical complications and follow it to the bedside. Of 19 studies, the models were often accurate. Two are in routine use — and the bottleneck is not the algorithm.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.