Skip to main content
Journal Club5 min read

When No Human Updates the Record: Machine Learning Meets the Fax Machine

A US health system taught software to read scanned colonoscopy reports and write follow-up dates into the record unsupervised. The build is clever and honest. But it is a single-site proof of concept, and only about a third of reports ever reached the automated step.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a scanned report page with a redacted date, a narrowing funnel of paper slips, and a single amber dot marking one record field.

Out of 690 colonoscopy reports the team checked by hand, the software got the follow-up date right and wrote it into the record in 188 of them — about one in four. That is the number I would put on the cover of this paper from NYU Langone Health, not the 80.7 percent accuracy in the abstract. The gap between those two figures is the whole story, and it is a story worth telling, because the problem the authors set out to solve is as ordinary as healthcare gets.

Here is the gap they were chasing. An outside colonoscopy report arrives by fax, gets scanned, and settles into the record as an image. The patient had polyps; the gastroenterologist recommended a return in three years. But the reminder field still says ten years — the default — because no one ever opened the image to change it. Three years on, the system stays silent. Roughly four in five clinical documents live as unstructured free text like this, much of it scanned from elsewhere, and none of it reaches the rules engines that depend on structured fields. The authors built a pipeline to close that one gap: read the scanned report, extract the recommended surveillance interval, and write it into the health-maintenance field with no person in the loop.

The design is the part worth keeping

Two components do the work. A machine-learning model — trained on 7,021 documents and reading them with optical character recognition and natural-language processing — proposes a follow-up date, but acts only when its own confidence exceeds 70 percent. A software robot, the screen-driving kind of robotic process automation that elsewhere fills web forms, then opens the record and writes the field, but only if none of nine predefined business exceptions applies. The most important of those exceptions is deliberately conservative: the robot will not overwrite a date a clinician has already entered. Where a human has judged, the machine stands down. That restraint, more than any accuracy figure, is the transferable lesson.

It is also worth being plain about what kind of study this is. It is a proof-of-concept process study — six staged steps from gap analysis to implementation — not a trial. No control group, no patient followed forward, no endpoint beyond whether the right date landed in the right field. Publishing that is entirely legitimate. It simply is not a randomized result, and a careful reader holds it at that strength.

What the numbers genuinely show

On validation the pipeline reached 80.7 percent overall accuracy (557 of 690 documents, 95% CI 77.8–83.7) at deciding whether a valid follow-up date was present. In live operation across the health system it processed 16,563 external colonoscopy reports between October 2023 and December 2024, and among the documents that reached the automation step, 77.2 percent (4,512 of 5,841) produced a successful update. The authors estimate the process improved the accuracy of colorectal-screening reminder dates by nearly 30 percent. Against a baseline where the alternative is someone re-typing dates off scanned faxes, that is a real and useful gain — and the honest, narrow claim the data can bear.

Where the claims outrun the evidence

Two figures decide whether this is a solution or a head start. The first is the false-negative rate of 32.9 percent (130 of 395): when a follow-up date was actually present, the model missed it almost a third of the time. In screening, the miss is the dangerous error — the patient who is quietly never recalled — and the system's main safeguard is that it stays silent rather than writing something wrong. The second is the funnel. Only 35.3 percent of the 16,563 reports (5,841) ever became 'RPA-ready'; the headline 77.2 percent applies to that smaller slice alone. End to end on validation, the pipeline correctly identified and updated 27.2 percent of cases — the 188-of-690 figure this piece opened with. Most of the work still needs a person.

When a follow-up date was present, the model missed it almost a third of the time. In screening, the miss is the dangerous error.

And the whole thing rests on one institution's stack — Epic, UiPath, OnBase — at a single US health system, with no external validation. The authors are candid that the architecture is brittle: a routine update to any one component can silently break the robot, and the missing-data rate in received reports is high. None of this voids the work; it bounds it. What generalizes is the design pattern — confidence-gated extraction, conservative write rules, a machine that pauses when unsure — not the pipeline, which would have to be rebuilt and revalidated anywhere else.

Why it matters here

European hospitals carry the same backlog of scanned outside reports and the same reminder fields quietly holding stale defaults. The appeal of automating that gap is obvious, and so is the trap: software that writes to the patient record largely unsupervised needs a named owner, an audit trail, and a clear answer to who is accountable when a date is wrong. The most transferable finding is not a tool but a discipline — let the machine act only where the context is tightly defined, let it stop when confidence is low, and keep a person able to override it at any point. Measured against a miss rate of one in three, that restraint is not caution for its own sake. It is the design working as intended.

Source: Stevens ER, Hartman J, Testa P, et al. Leveraging Machine Learning and Robotic Process Automation to Identify and Convert Unstructured Colonoscopy Results Into Actionable Data: Proof-of-Concept Study. JMIR Medical Informatics 2025;13:e73504. A single-centre proof-of-concept development study with no external validation and no patient-outcome endpoint; its headline success rate applies only to the third of reports that reached the automated step.

#Journal Club#Clinical AI#Health Informatics#Evidence-Based Medicine#Automation

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.