Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.
A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven Jungmann
CEO

Buried in the limitations of a new systematic review is the sentence that should set the agenda for everyone deciding whether to put a language model near a patient record. The authors note a "limited assessment of inference latency, deployment feasibility, and operational costs," because "most studies prioritize accuracy over practical implementation metrics." In plain terms: the field has measured, many times over, whether these models classify clinical text accurately. It has barely measured whether they can actually be run on a ward. The headline says yes; the fine print says we don't know.
The review in question, published in *JMIR AI*, is worth reading precisely because it is honest about that gap. Hajar Sakai and Sarah Lam at Binghamton University screened 826 papers and, applying PRISMA screening, kept 65 — studies published between 2020 and the third quarter of 2024, with 28 of those 65 appearing in the first three quarters of 2024 alone. The field is moving fast. The question is whether it is moving toward the bedside or only around it.
The finding, stated fairly
Text classification is the quiet machinery of clinical informatics: assigning a diagnosis code, flagging an adverse drug reaction in a discharge letter, triaging a pathology report, sorting patient correspondence. High-volume, rule-bound work where a reliable classifier earns its keep without anyone noticing. On the narrow question of whether language models do this better than the classical machine-learning methods they were benchmarked against, the review's answer is a clear yes — and the strength of that answer lies in its consistency. Across binary, multiclass and multilabel tasks, across clinical notes, patient messages and the research literature, dozens of independent groups reached the same verdict. A single benchmark is easy to wave away; agreement at this scale is not. If you only wanted to know whether the models can classify clinical text well in a research setting, you can stop reading here.
The composition of the 65 studies is itself informative. Fine-tuning was the workhorse (35 studies), well ahead of prompt engineering (17). The locally-runnable BERT family — the option that keeps patient data inside the institution — handled about half of the multilabel work, while closed-source GPT-family models led on binary (44.0%) and multiclass (30.6%) tasks. Clinical decision support was the single commonest application, in 29 of the 65 studies. The review is also candid about the trade-off every hospital faces: local models keep data in-house but demand labelled training sets and engineering effort, whereas closed, prompt-driven models deploy fast without labelled data but send text to an external server — and, the authors note, their operational costs "can be substantial for high-volume applications." There is no costless choice, only a negotiation between data control and convenience.
Where the claim stops travelling
"More accurate than the baseline" is a statement about a held-out test set, usually from one institution, almost always in English. More than 80 percent of the included studies used English-language data, and the authors flag that this focus "hinder[s] the development of multilingual approaches" — which is to say a German hospital adopting one of these methods is largely off the map the literature has charted. They are equally direct that single-institution datasets "restrict the generalizability of results across different health care settings." Accuracy at one site does not transfer automatically to a house with a different documentation culture and a different hospital information system. And the cases where automated classification would help most — rare conditions — are exactly where the authors warn that imbalanced data "can significantly skew" model performance.
Then there is the gap the studies left open by simply not looking. The review observes that long clinical documents force chunking strategies that "can lead to slower inference speeds and potentially delay real-time applications," and that "high operational costs of advanced LLMs … can pose barriers to practical deployment." These are not exotic concerns; they are the first questions an operations director would ask. A test-set accuracy figure answers none of them. The performance question has been settled many times; the deployment question has scarcely been put.
“The technology is ready in the sense the studies tested. The conditions for using it safely — privacy under the GDPR, German-language validation, integration into hospital systems, interpretability at the point of decision — are not what these studies measured.”
What the authors would do next
The review's own recommendations map the road from a promising result to a usable one, and they are sober rather than grand. Parameter-efficient fine-tuning (PEFT, including LoRA) to make adaptation feasible without large compute budgets. Synthetic data to balance rare classes — with the explicit caveat that recursive training on synthetic text risks model collapse. Federated learning, so institutions can train on pooled data without moving patient records. Temporal modelling, to capture the chronology that ordinary classification flattens. And, tellingly, a plea for studies to report "deployment specifications — such as hardware, latency, and throughput," alongside explainable-AI methods so a clinician can see why a label was assigned.
For a clinical or administrative leader in a German or European hospital, the practical reading is calm rather than urgent. The classifiers work in the lab. The surrounding conditions — data protection under the GDPR, validation on German-language records, integration into the hospital information system, and interpretability where a clinical decision turns on the output — are the parts no one has yet shown to work. The first 65 studies established that the models are good at the easy half of the problem. The next 65 will have to be about the hard half.
Source: Sakai H, Lam SS. Large Language Models for Health Care Text Classification: Systematic Review. JMIR AI 2026;5:e79202 (published 11 February 2026; authors declared no conflicts of interest). A peer-reviewed PRISMA systematic review of 65 single-purpose studies, most from single institutions and in English; it establishes comparative accuracy in research settings, not real-world clinical performance, cost or generalisability.


