Skip to main content
Journal Club4 min read

1,800 Reports in 4.45 Hours: An Engineering Report Wearing a Study's Title

A urology group structured 1,800 free-text MRI reports through a GPT-4 pipeline at under a cent each. The throughput is real and the plumbing is the point — but the word 'validation' in the title describes earlier papers, not this one.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a hand placing a paper clinical report onto a stack that narrows through a teal funnel into a single line of structured data, with a clock arc above and one amber accent.

Nine-tenths of a cent. That was the cost, on the authors' own ledger, of turning a single free-text prostate-MRI report into sixteen structured fields. Run the arithmetic across the whole batch and a urology group at the University of California, San Francisco converted 1,800 radiology narratives into research-grade rows for roughly the price of a sandwich. The job took 4.45 hours — about 8.90 seconds per report — on an institutional, privacy-secured GPT-4, and every single document finished: a 100 percent completion rate. Out came prostate volume, PSA density, PI-RADS score, clinical staging, anatomical findings, sixteen fields in all.

It is a genuinely useful result. It is also worth being exact about which result it is, because the paper's own title promises something the study deliberately did not do. The authors call this a technical implementation study, and that is the honest label to read it under.

The plumbing, documented

The pipeline, named UODBLLM, is a modular Python application bolted onto an existing urological research database. Its loop is unremarkable on purpose: pull an unstructured note, wrap it in a versioned XML prompt template, send it to the model, take back structured output, write the result home. The prompt templates live in database tables rather than in source code, so a coordinator can revise them without redeploying anything, and the GPT-4 backend is treated as swappable — the design is model-agnostic by intent. It ran on a 2019 laptop. The contribution is precisely the lack of glamour: the paper writes down the connective tissue most published demonstrations omit, including the database binding, the error handling, the quality-assurance step, and a per-report cost line.

Two numbers that look alike and are not

The most load-bearing word in the title is validation, and it points elsewhere. This study did not re-measure extraction accuracy — the authors state so plainly. The accuracy figures attached to the method come from two of their own earlier papers, on different report sets: a median field-level accuracy of 98.1 percent (interquartile range 96.3 to 99.2 percent) across 424 reports, and above 95 percent on a further 228. None of that was reproduced here. So the 100 percent that this paper does report is a completion rate, not a correctness rate: every report produced an answer; whether each answer was right is a separate question the study did not ask. A busy reader can collapse those two quantities in a glance, and the title invites the slip.

The efficiency headline deserves the same care. The figure of up to a 90 percent reduction in manual extraction time is, in the authors' wording, an estimate of what the approach could save when many variables are pulled from many reports — not a head-to-head measurement against human abstractors doing the same task. It is an engineering projection, offered as one.

What the result actually carries

Held to throughput and operability — the things it did measure — the paper earns its conclusion cleanly. Across eighteen batches of 100, the system processed all 1,800 reports without a failed run, at a stable measured speed and a token cost low enough to make the dollar total a rounding error. As a demonstration that a free-text-to-structured-data pipeline can be embedded in a live clinical research database and run end to end at this scale, it stands up.

The boundaries are the ones the authors flag themselves: one centre, one report type, one institution-specific model, with the explicit caveat that performance may shift on other models and that the accuracy of language-model extraction still needs human checking for the data points that count. The architecture is built to generalise; generalisation is asserted here, not shown.

A 100 percent completion rate tells you every report produced an answer. It does not tell you the answer was right — and this study did not set out to ask.

Where the cost moved

For years the binding constraint on turning clinical free text into research-grade data was the technology itself. This paper is a quiet marker that the constraint has moved. The expensive part is no longer the reading; it is the integration — the database wiring, the versioned prompts, the error handling, and above all the human quality-assurance loop that the authors keep but do not re-quantify. That reframing holds regardless of jurisdiction, and it changes the question for anyone running a tumour registry or an outcomes database in free text: not whether a model can parse a report, but what it costs to embed one safely and how its accuracy gets checked once it is live. The plumbing has become the point — and confirming that the answers are right remains someone else's paper to write.

Source: Carlisle MN, Pace WA, Liu AW, Krumm R, Cowan JE, Carroll PR, Cooperberg MR, Odisho AY. Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records. JMIR Bioinformatics and Biotechnology 2026;7:e70708. A single-centre technical implementation study; extraction accuracy was not re-measured here but cited from the authors' prior work, and no conflicts of interest were declared.

#Journal Club#Clinical AI#Data Engineering#Large Language Models#Evidence-Based Medicine

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.