Skip to main content
Journal Club5 min read

Stanford Put a Language Model Inside the Chart. What the Report Can Prove

An academic centre embedded language models in its medical record and counted what happened: a thousand voluntary users, claimed millions in savings, and — to its credit — two unsupported statements per summary. A candid deployment report, not a controlled study.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hand on a keyboard beside a teal chat panel, with slips of redacted clinical text peeling away and a single amber accent.

Over eighteen months, 1,075 clinicians at one American academic centre trained on a new tool and then kept using it — logging more than 23,000 sessions and over 19 billion processed tokens in the first three months alone. Nobody was made to. For most clinical software, that is the result that never arrives: people try it once, find it slower than what they had, and quietly stop. So a report in which a thousand physicians voluntarily fold a language model into daily care is worth opening. The question a journal club then asks is the harder one: what, exactly, does the report prove?

The tool is ChatEHR, described in an arXiv preprint from Stanford Health Care that has not yet been peer-reviewed. It wires large language models to the entire patient timeline — several years of records assembled into a structured data bundle — and surfaces them two ways: as an interactive chat panel inside the electronic record, and as automated background tasks that run a fixed prompt against fixed data. The premise is the one most clinicians will recognise: a great deal of the working day goes into synthesising long, scattered records, and a tool that does the assembling could hand some of that time back.

The shape of the evidence

Before any finding can mean anything, the design has to be named plainly. Stanford built the system, rolled it out across its own institution, and then wrote down what it saw. There is no control group, no randomisation, no matched site doing something different for comparison. This is an N-of-one experience report — careful, quantified, unusually candid, but a record of what one place did, not a test of whether it beat the alternative. Read at that strength, it is still valuable; read as proof that the tool works, it overreaches.

What it can show: use and error

Adoption is the firmest claim, because it is simply counted rather than modelled. Seven automations went into routine use; the chat interface accumulated 1,075 trained, recurring users. Sustained, unforced use is not nothing — most clinical AI dies on contact with a real workflow, and this did not.

The more telling result is the one that flatters the authors least. Auditing a ten-percent sample — 1,649 conversations, of which 719 were summary requests — they found a mean of 2.33 unsupported claims per summary: 0.73 outright hallucinations and 1.60 statements that contradicted the record, with roughly half the summaries containing one or none. A team that publishes that number about its own product, in a paper meant to showcase it, is reading its own system honestly. The conclusion the authors draw is the right one: the tool is useful enough that clinicians keep choosing it, and wrong often enough that someone has to check it. Both hold simultaneously. A deployment that denies the second half is the dangerous kind.

The tool is useful enough that clinicians keep choosing it, and it errs often enough that a human must check its output. Both are true at once.

Where the headline outruns the data

The figure that travelled furthest is the shakiest. The $6 million in first-year savings is the authors' own estimate at current adoption, and they present it as such. Pull it apart and most of it is not keystrokes saved but revenue projected: a single automation that screens patients for transfer to a lower-acuity unit is credited with $2.4–3.3 million in annual revenue growth from about 1,700 transfers a year, against only modest labour savings of roughly $100,000. The chat panel's headline value — around $2.2 million a year against some $20,000 in model costs — is an extrapolation: one hundred daily users, about three queries each, ten assumed minutes saved per query, priced at a median salary. These are defensible internal projections. They are not measured savings, and they were assembled by the team with the strongest interest in the answer.

And no figure in the paper touches the outcome that matters most. No one followed patients to learn whether faster summaries produced better decisions, earlier diagnoses, or fewer downstream errors. A quicker, fuller chart review is plausibly better care — but plausibly is not the same as demonstrably, and an uncontrolled single-site report cannot separate the two. The same caution applies to transfer: one institution's data plumbing, governance maturity and engineering bench cannot be assumed to exist at a hospital that lacks them.

The part worth borrowing

Taken for what it is, this is a useful document — one of the few candid accounts of language models meeting the full patient record in routine use, error rates and all. The portable lesson is not in the dollar estimates but in the method, and the authors are explicit about it: standard benchmark evaluation was insufficient. Passing a battery of medical exams told them little about how the tool behaved against three years of one real patient's notes, so they built continuous error measurement in live use instead. That is the transferable craft. For any European institution weighing such tools, the posture to copy is the one the authors model — count what your system gets wrong, in your own setting, and decide how you will measure it before you deploy, not after.

Source: Shah NH, Pfeffer MA, et al. Adoption and Use of LLMs at an Academic Medical Center. arXiv preprint 2602.00074, submitted 21 January 2026. An uncontrolled single-centre deployment report by the system's developers, not peer-reviewed; its economic figures are the authors' own estimates, and it measures adoption and error rates, not patient outcomes.

#Journal Club#Clinical AI#Electronic Health Records#Evidence-Based Medicine#Large Language Models

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.