Journal Club21 May 20265 min read

Stanford Put a Language Model Inside the Chart. What the Report Can Prove

An academic centre embedded language models in its medical record and counted what happened: a thousand voluntary users, claimed millions in savings, and — to its credit — two unsupported statements per summary. A candid deployment report, not a controlled study.

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hand on a keyboard beside a teal chat panel, with slips of redacted clinical text peeling away and a single amber accent.

Over eighteen months, 1,075 clinicians at one American academic centre trained on a new tool and then kept using it — logging more than 23,000 sessions and over 19 billion processed tokens in the first three months alone. Nobody was made to. For most clinical software, that is the result that never arrives: people try it once, find it slower than what they had, and quietly stop. So a report in which a thousand physicians voluntarily fold a language model into daily care is worth opening. The question a journal club then asks is the harder one: what, exactly, does the report prove?

The tool is ChatEHR, described in an arXiv preprint from Stanford Health Care that has not yet been peer-reviewed. It wires large language models to the entire patient timeline — several years of records assembled into a structured data bundle — and surfaces them two ways: as an interactive chat panel inside the electronic record, and as automated background tasks that run a fixed prompt against fixed data. The premise is the one most clinicians will recognise: a great deal of the working day goes into synthesising long, scattered records, and a tool that does the assembling could hand some of that time back.

The shape of the evidence

Before any finding can mean anything, the design has to be named plainly. Stanford built the system, rolled it out across its own institution, and then wrote down what it saw. There is no control group, no randomisation, no matched site doing something different for comparison. This is an N-of-one experience report — careful, quantified, unusually candid, but a record of what one place did, not a test of whether it beat the alternative. Read at that strength, it is still valuable; read as proof that the tool works, it overreaches.

What it can show: use and error

Adoption is the firmest claim, because it is simply counted rather than modelled. Seven automations went into routine use; the chat interface accumulated 1,075 trained, recurring users. Sustained, unforced use is not nothing — most clinical AI dies on contact with a real workflow, and this did not.

The more telling result is the one that flatters the authors least. Auditing a ten-percent sample — 1,649 conversations, of which 719 were summary requests — they found a mean of 2.33 unsupported claims per summary: 0.73 outright hallucinations and 1.60 statements that contradicted the record, with roughly half the summaries containing one or none. A team that publishes that number about its own product, in a paper meant to showcase it, is reading its own system honestly. The conclusion the authors draw is the right one: the tool is useful enough that clinicians keep choosing it, and wrong often enough that someone has to check it. Both hold simultaneously. A deployment that denies the second half is the dangerous kind.

“The tool is useful enough that clinicians keep choosing it, and it errs often enough that a human must check its output. Both are true at once.”

Where the headline outruns the data

The figure that travelled furthest is the shakiest. The $6 million in first-year savings is the authors' own estimate at current adoption, and they present it as such. Pull it apart and most of it is not keystrokes saved but revenue projected: a single automation that screens patients for transfer to a lower-acuity unit is credited with $2.4–3.3 million in annual revenue growth from about 1,700 transfers a year, against only modest labour savings of roughly $100,000. The chat panel's headline value — around $2.2 million a year against some $20,000 in model costs — is an extrapolation: one hundred daily users, about three queries each, ten assumed minutes saved per query, priced at a median salary. These are defensible internal projections. They are not measured savings, and they were assembled by the team with the strongest interest in the answer.

And no figure in the paper touches the outcome that matters most. No one followed patients to learn whether faster summaries produced better decisions, earlier diagnoses, or fewer downstream errors. A quicker, fuller chart review is plausibly better care — but plausibly is not the same as demonstrably, and an uncontrolled single-site report cannot separate the two. The same caution applies to transfer: one institution's data plumbing, governance maturity and engineering bench cannot be assumed to exist at a hospital that lacks them.

The part worth borrowing

Taken for what it is, this is a useful document — one of the few candid accounts of language models meeting the full patient record in routine use, error rates and all. The portable lesson is not in the dollar estimates but in the method, and the authors are explicit about it: standard benchmark evaluation was insufficient. Passing a battery of medical exams told them little about how the tool behaved against three years of one real patient's notes, so they built continuous error measurement in live use instead. That is the transferable craft. For any European institution weighing such tools, the posture to copy is the one the authors model — count what your system gets wrong, in your own setting, and decide how you will measure it before you deploy, not after.

Source: Shah NH, Pfeffer MA, et al. Adoption and Use of LLMs at an Academic Medical Center. arXiv preprint 2602.00074, submitted 21 January 2026. An uncontrolled single-centre deployment report by the system's developers, not peer-reviewed; its economic figures are the authors' own estimates, and it measures adoption and error rates, not patient outcomes.

#Journal Club#Clinical AI#Electronic Health Records#Evidence-Based Medicine#Large Language Models

Stanford Put a Language Model Inside the Chart. What the Report Can Prove

The shape of the evidence

What it can show: use and error

Where the headline outruns the data

The part worth borrowing

Keep reading

Why aiomics for QM reports and quality analytics

The 4 p.m. Hazard: When Bad Software Becomes a Clinical Risk

The Value of AI Isn't Prediction. It's Cognitive Ergonomics.

This analysis comes from the people behind Visite.

Want to see this in your hospital?