Skip to main content
Journal Club5 min read

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.

The summaries were written at a Flesch–Kincaid grade level of 10.6 — roughly tenth-grade reading difficulty. The prompt had asked for sixth-to-eighth grade. That single gap, between what a language model was told to produce for families and what it actually produced, is a small window onto a larger question this study is unusually honest about: who is the right judge of a clinical text meant for patients?

The study, published in JMIR AI on 10 February 2026 by a group from the Lucile Packard Children's Hospital at Stanford, is small and clear about its scope. The team took the assessment-and-plan sections of two consecutive daily progress notes for each of 50 children on a paediatric cardiovascular intensive care unit and summarised them with GPT-4o mini. Eight paediatric cardiologists and ten parents of children with heart disease then rated those summaries — two clinicians and two parents per summary, on different rubrics, for different things. The interesting move is not the summarisation. It is the decision to ask two kinds of reader the same underlying question and watch them part ways.

Where the readers diverge

Asked how helpful the summaries would be for families, parents answered warmly: three helpfulness items scored between 3.25 and 3.36 on a four-point scale. Clinicians, judging that same dimension — would this help a family? — landed at 2.97. A Mann–Whitney U test confirms the gap is real rather than noise (U = 3897; z = 2.69; P = .007). It is a modest distance in absolute terms and a clear one statistically: the people the tool is for and the people who would sign off on it do not rate its usefulness the same way.

On their own clinical yardstick the cardiologists were measured. Accuracy came in at 3.19, completeness at 3.04, no-revision-needed at 2.96, clinical alignment at 2.90 — competent, not glowing. Agreement within each group was only moderate: a Krippendorff α (a measure of inter-rater reliability, where 1 is perfect agreement) of 0.69 among clinicians and 0.75 among parents. So the disagreement is not merely clinicians-versus-parents; there is real spread inside each camp too.

The finding is methodological, not a score

Read carelessly, the headline is a polite tie: parents liked it a bit more, clinicians a bit less. Read carefully, the contribution is elsewhere. Patient-facing AI is almost always evaluated the way clinical decision-support is — by clinical experts, against clinical criteria such as accuracy, completeness and fidelity to the source. For software that informs a clinical decision, that is exactly the right test. For software whose job is to help a worried relative follow what is happening, it measures the wrong thing well.

A rubric that only asks whether a summary is clinically accurate cannot tell you whether the mother at the bedside understood it.

Whether a text is clinically precise tells you nothing about whether the mother at the bedside understood it, whether it helped her shape the next conversation with the team, or whether it sharpened her worry instead of ordering it. Those are separate questions, and they need separate judges. A clinical rubric not capturing this is not a defect in the rubric — it is its boundary. What the study does well is make that boundary visible, and it returns to the readability number to do so: the warm parent ratings (readability scored 3.36) sit awkwardly next to a tenth-grade reading level, and probably reflect a health-literate volunteer group rather than parents in general.

What it does not support

The authors name their limits plainly, and they bind. This is a single centre, a highly specialised US institution and a narrow subspecialty, which constrains generalisability; the model overshot the readability it was told to hit; the raters only moderately agreed. The sharpest caveat is the easiest to overlook: parents rated summaries of other children's notes, not their own. That deliberately removes the emotional stake — a parent reading about their own critically ill child — that the tool would carry in real use. The reading captured here is calmer than the one that matters, which means even the parent scores may understate the stakes rather than overstate the benefit.

None of this dents the central claim, because the claim is narrow and well-chosen: judging a patient-communication tool by clinical fidelity alone risks the wrong verdict in either direction. A tool that scores middling with clinicians yet genuinely serves families better than what they had will be undervalued by a purely clinical assessment — and a tool that reads cleanly to experts but confuses families will be overvalued by it. This is a pilot raising a structural question in a thinly studied area, not a study that settles it.

The relevance to European hospitals investing in AI-assisted documentation is direct. The criteria that decide procurement and rollout are almost always clinical and administrative — efficiency, documentation quality, interoperability, data protection. Whether patients and relatives actually understand the generated text, find it useful, and can carry it into their own conversations with the team is rarely measured at all. The quiet argument of this paper is simply that someone should.

Source: Han B, Barnes T, Reddy CD, Shin AY. Evaluating Large Language Model–Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study. JMIR AI 2026;5:e85221. A single-centre retrospective pilot with no funding and no declared conflicts of interest; it raises a methodological question rather than settling it.

#Journal Club#Clinical AI#Patient Communication#Evidence-Based Medicine#Large Language Models

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinician's still hands on a keyboard beneath a teal performance line drifting downward off a navy block, with a single amber accent marking the unnoticed dip.
Journal Club

The Governance Gap: Why Clinical AI Fails After It Passes Validation

A clinical model clears validation, goes live, and slowly drifts — and no one is assigned to watch. A narrative review maps why oversight, not algorithms, is now the binding constraint on healthcare AI. Read for what a review can and cannot prove.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.