Two Readers, One Summary: Who Should Grade Patient-Facing AI?
A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven Jungmann
CEO

The summaries were written at a Flesch–Kincaid grade level of 10.6 — roughly tenth-grade reading difficulty. The prompt had asked for sixth-to-eighth grade. That single gap, between what a language model was told to produce for families and what it actually produced, is a small window onto a larger question this study is unusually honest about: who is the right judge of a clinical text meant for patients?
The study, published in JMIR AI on 10 February 2026 by a group from the Lucile Packard Children's Hospital at Stanford, is small and clear about its scope. The team took the assessment-and-plan sections of two consecutive daily progress notes for each of 50 children on a paediatric cardiovascular intensive care unit and summarised them with GPT-4o mini. Eight paediatric cardiologists and ten parents of children with heart disease then rated those summaries — two clinicians and two parents per summary, on different rubrics, for different things. The interesting move is not the summarisation. It is the decision to ask two kinds of reader the same underlying question and watch them part ways.
Where the readers diverge
Asked how helpful the summaries would be for families, parents answered warmly: three helpfulness items scored between 3.25 and 3.36 on a four-point scale. Clinicians, judging that same dimension — would this help a family? — landed at 2.97. A Mann–Whitney U test confirms the gap is real rather than noise (U = 3897; z = 2.69; P = .007). It is a modest distance in absolute terms and a clear one statistically: the people the tool is for and the people who would sign off on it do not rate its usefulness the same way.
On their own clinical yardstick the cardiologists were measured. Accuracy came in at 3.19, completeness at 3.04, no-revision-needed at 2.96, clinical alignment at 2.90 — competent, not glowing. Agreement within each group was only moderate: a Krippendorff α (a measure of inter-rater reliability, where 1 is perfect agreement) of 0.69 among clinicians and 0.75 among parents. So the disagreement is not merely clinicians-versus-parents; there is real spread inside each camp too.
The finding is methodological, not a score
Read carelessly, the headline is a polite tie: parents liked it a bit more, clinicians a bit less. Read carefully, the contribution is elsewhere. Patient-facing AI is almost always evaluated the way clinical decision-support is — by clinical experts, against clinical criteria such as accuracy, completeness and fidelity to the source. For software that informs a clinical decision, that is exactly the right test. For software whose job is to help a worried relative follow what is happening, it measures the wrong thing well.
“A rubric that only asks whether a summary is clinically accurate cannot tell you whether the mother at the bedside understood it.”
Whether a text is clinically precise tells you nothing about whether the mother at the bedside understood it, whether it helped her shape the next conversation with the team, or whether it sharpened her worry instead of ordering it. Those are separate questions, and they need separate judges. A clinical rubric not capturing this is not a defect in the rubric — it is its boundary. What the study does well is make that boundary visible, and it returns to the readability number to do so: the warm parent ratings (readability scored 3.36) sit awkwardly next to a tenth-grade reading level, and probably reflect a health-literate volunteer group rather than parents in general.
What it does not support
The authors name their limits plainly, and they bind. This is a single centre, a highly specialised US institution and a narrow subspecialty, which constrains generalisability; the model overshot the readability it was told to hit; the raters only moderately agreed. The sharpest caveat is the easiest to overlook: parents rated summaries of other children's notes, not their own. That deliberately removes the emotional stake — a parent reading about their own critically ill child — that the tool would carry in real use. The reading captured here is calmer than the one that matters, which means even the parent scores may understate the stakes rather than overstate the benefit.
None of this dents the central claim, because the claim is narrow and well-chosen: judging a patient-communication tool by clinical fidelity alone risks the wrong verdict in either direction. A tool that scores middling with clinicians yet genuinely serves families better than what they had will be undervalued by a purely clinical assessment — and a tool that reads cleanly to experts but confuses families will be overvalued by it. This is a pilot raising a structural question in a thinly studied area, not a study that settles it.
The relevance to European hospitals investing in AI-assisted documentation is direct. The criteria that decide procurement and rollout are almost always clinical and administrative — efficiency, documentation quality, interoperability, data protection. Whether patients and relatives actually understand the generated text, find it useful, and can carry it into their own conversations with the team is rarely measured at all. The quiet argument of this paper is simply that someone should.
Source: Han B, Barnes T, Reddy CD, Shin AY. Evaluating Large Language Model–Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study. JMIR AI 2026;5:e85221. A single-centre retrospective pilot with no funding and no declared conflicts of interest; it raises a methodological question rather than settling it.


