Skip to main content
Journal Club5 min read

When the Simpler Model Won: A Clinical BERT Beaten by Plain Word Vectors

A purpose-built clinical language model scored AUROC 0.59 at predicting heart-failure readmission. A far simpler embedding, trained on the dataset's own codes, scored 0.65. The more interesting number is that neither is good enough to act on.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of two stacked paper bars, the lower teal one longer than the upper navy one, over faint halftone code fragments and a single amber dot.

Sixty-five hundredths. That is the best score any of the methods in this study managed at the task it was built for: predicting which heart-failure patients would land back in hospital within 30 days. In the language of these models, that score is an area under the receiver-operating curve (AUROC, a single number for how cleanly a model separates two outcomes; 0.5 is a coin-toss, 1.0 is perfect). A 0.65 means the model ranks a randomly chosen returning patient above a randomly chosen non-returning one about two times in three. Hold on to that figure — it matters more than the headline that came with it.

The headline was an upset. BioClinicalBERT — a language model built and pretrained specifically for clinical text, the kind of tool you would reach for first — reached 0.59. A far plainer method, Word2Vec embeddings trained directly on the diagnostic codes in the study's own data, reached the 0.65. The model with the medical pedigree finished behind the home-grown one. That is the result people will repeat. The result worth keeping is what the winning number can and cannot do.

The setup

This is a clean, single-task comparison published in JMIR Medical Informatics in November 2025. The authors drew 21,031 heart-failure patients from MIMIC-IV — a widely used research database of de-identified records from the Beth Israel Deaconess Medical Center in Boston, covering 2008 to 2019 — of whom 3,933, or 19 percent, were in fact readmitted within 30 days. The question was narrow and well-posed: which way of turning a patient's diagnostic history into numbers best predicts that readmission? Four representations went head to head — a plain one-hot baseline, BioClinicalBERT, and Word2Vec embeddings learned either from International Classification of Diseases (ICD) codes or from Unified Medical Language System concept identifiers — each handed to standard machine-learning classifiers. On the held-out test set the order was unambiguous: one-hot 0.54, BioClinicalBERT 0.59, Word2Vec on ICD codes 0.65, Word2Vec on concept identifiers 0.65.

What 0.65 buys at the bedside

Not enough to act on. Discrimination this weak, paired with a 19 percent base rate, leaves the positive predictive value at any usable threshold modest — most patients the model flags will not return, and a fair number who do will be missed. The honest sentence is not "the simpler model is good enough." It is "none of these representations, on diagnostic codes alone, solves this problem." The comparison is genuine and the ranking is stable, but the winner sits well short of anything you would let near a discharge decision. A study that lands at 0.65 and says so plainly is doing the reader a service; the failure mode in this literature is a paper that reports 0.65 in the abstract and discusses it as though it were 0.85.

Why the plain method came first

The mechanism the authors propose is the part that travels. An embedding learned from the co-occurrence patterns inside the target dataset captures the signal specific to this task and this population. BioClinicalBERT carries a great deal of general clinical knowledge from pretraining on hospital notes — but here it was not reading notes. It was applied to long-form descriptions of structured codes and concatenated medication fields, not the free narrative text it was designed for. A model with broader knowledge is not automatically the better fit for a narrow, structured prediction; a representation tuned to the local data can simply carry more of the relevant variance. That is a defensible, mechanistic finding, and it generalises past this one task.

The useful result is not that the cheaper model won. It is that none of them, on codes alone, is good enough to act on.

What the ceiling is made of

The authors are candid about the limits, and the limits explain a good deal of the 0.65. Each admission was modelled as a static bag of codes with no temporal ordering — the sequence and timing of diagnoses, which a clinician reads as a trajectory, were thrown away. The data come from a single institution, so portability to another hospital's coding habits is untested. And readmission is an intrinsically noisy target, driven by social circumstance and local discharge practice far more than by anything an ICD list records. A ceiling of 0.65 may be telling us as much about the limits of coded data for this question as about the four models competing on it.

For the ward and the procurement file

Heart failure is among the most common admitting diagnoses in German and European hospitals, and 30-day readmission is precisely the target a vendor will offer to predict. Two habits from this paper are worth carrying into that conversation. First, before adopting a large, general-purpose clinical model for a specific task, check whether a simpler model trained on your own data does at least as well — the answer is empirical, not something the model's reputation settles. Second, treat a single discrimination number as the start of the question, not the end of it: it says nothing about whether the model holds up on a different population, where coding conventions and case mix shift the ground beneath it. Name the denominator, read the AUROC at the relevant base rate, and doubt the score until someone reproduces it elsewhere. That discipline is the part of this study worth taking back to the clinic.

Source: Shakya P, Khaneja A, Wagholikar KB. Predicting 30-Days Hospital Readmission for Patients With Heart Failure Using Electronic Health Record Embeddings: Comparative Evaluation. JMIR Medical Informatics 2025;13:e73020. A single-institution, retrospective comparison on the MIMIC-IV research database, supported by grants from the National Heart, Lung, and Blood Institute and Amazon Web Services, with no conflicts declared; it benchmarks representations, not a deployable clinical tool, and its best model reaches only modest discrimination.

#Journal Club#Clinical AI#Predictive Modeling#Evidence-Based Medicine#Health Informatics

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.