Skip to main content
Journal Club5 min read

An AUROC of 0.805, Sitting on 97 Percent Heterogeneity

Twenty-eight machine-learning models claim to predict delirium after heart surgery. Pooled, they look clinically useful. Read the validation methods and the heterogeneity, and the single number stops meaning what it appears to.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a recovery-room patient's hand on a bedrail framed by a teal circle, with twenty-eight uneven navy bars behind it and one amber stripe standing apart.

Twenty-one of these studies bothered to test their model on data the model had not seen during training. Exactly one of those twenty-one used data from a hospital that did not build it. That single number — one externally validated model out of twenty-eight — tells you most of what you need to know before you take the headline seriously. The headline is a pooled discrimination of 0.805 for predicting delirium after cardiac surgery, and on its own it looks like a tool worth having.

The condition is worth predicting. Roughly one in twelve people who undergo heart surgery wakes into a delirium in the days that follow — in this pooled population, 6,326 of 80,143 patients. It lengthens intensive-care stays, raises mortality, and shadows recovery for months. Flag the patients heading for it, and you can act early: lighter sedation, earlier mobilisation, targeted prophylaxis. So when Guo and colleagues set out to weigh the prediction models built for exactly this, the question they were really testing was not whether the idea is good. It is whether the tools are.

How the review was done, and what holds up

Rather than train one more model, the authors systematically reviewed 28 of them, drawn from studies covering 80,143 cardiac-surgery patients across twelve countries between 2012 and 2024. The pooled C-index — the area under the receiver-operating-characteristic curve, AUROC, which measures how cleanly a model separates the patients who will develop delirium from those who will not — came to 0.805 (95% CI 0.759–0.852) on the validation data, with a pooled sensitivity of 0.72 and specificity of 0.78. Read literally, that is a model catching roughly three of every four patients on their way to trouble.

Two things genuinely hold up. The recurring predictors are unglamorous and already in the chart — age, creatinine, time on cardiopulmonary bypass, the Mini-Mental State Examination score, left-ventricular ejection fraction. None demands a new test or a new sensor; they sit in almost every cardiac record, which means a model built on them could in principle run without disrupting anyone's day. And the review does not simply average its studies and declare a winner. It runs a formal risk-of-bias appraisal — and that appraisal is where the figure of 0.805 comes apart.

Why the single number does not travel

Start with the heterogeneity. The I-squared statistic — the share of variation between studies that reflects genuine differences rather than chance — reached 97.3 percent on the validation data, and 98.8 percent on the training data. The authors call it extreme, and the word is earned. At that level the 28 models are not one method measured under 28 conditions; they are 28 different things being added together. Pooling them yields a clean central estimate with a confidence interval, but that centre is not a property any single model would carry to any single bedside. It is the middle of a cloud, not the performance of a tool.

The bias appraisal arrives at the same place from the other side. Under PROBAST — the Prediction Model Risk of Bias Assessment Tool — 26 of the 28 models were rated high risk; only two were low. The faults are the familiar ones: too few delirium events per candidate variable, so the model fits noise; thin handling of missing data; weak guards against overfitting. And the validation picture is the part most worth dwelling on. Of the 21 studies that reported a validation set, 11 split their own data at random, 4 divided it by time, 3 used k-fold cross-validation, 2 resampled by bootstrap — and exactly one tested its model on an external cohort. Twenty-three of the 28 studies were single-centre. What the other twenty validations measure is how well a model memorised its own hospital, not whether it will work in yours.

A pooled C-index of 0.805 is not a stable estimate. It is the mathematical centre of results too different to belong in the same average.

What it means for the reader

This is the line between a benchmark and a clinical tool, and it is the line a reader who stops at the abstract will miss. A discrimination above 0.8 is a real and encouraging signal that the problem is learnable from routine data. It is not evidence that any one of these models discriminates at 0.8 on a ward it has never seen, where the baseline delirium rate, the surgical mix, and the way delirium is even assessed all differ from the training set — and assessment method does vary across these studies, which by itself shifts the apparent incidence. Guo and colleagues draw exactly the conclusion their data permit: the performance is promising, the evidence base too thin and too internally validated to act on. The honest next step is not deployment. It is external, multicentre, where possible prospective validation — before a single one of these scores is allowed to change what happens to a patient at four in the morning.

Source: Guo Y, Xu H, Wang A, Zhang M, Zhang S, Xie P. The Predictive Value of Machine Learning for Postoperative Delirium in Cardiac Surgery: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2026;28:e72304. A systematic review and meta-analysis: its pooled estimate rests on 97 percent heterogeneity and on a model set in which 26 of 28 studies were judged at high risk of bias and only one was externally validated.

#Journal Club#Clinical AI#Cardiac Surgery#Evidence-Based Medicine#Prediction Models

Keep reading

Editorial collage of a smartphone with a blank teal screen lying on an empty hospital bedside table, with a single amber accent at the screen's edge.
Journal Club

The Best App in the World, and No One on the Ward to Use It

Twenty clinicians explain why good mental-health apps never reach patients. The obstacle is almost never the technology. It is whose job it is to introduce the tool, watch the alerts, and answer when something looks wrong — questions no software answers.

Dr. Sven JungmannCEO
Editorial collage of an older person's wrist with a plain band rendered as a teal arc, faint activity waveforms below, and one amber dot marking a single external validation link.
Journal Club

Wearables and Dementia: A Strong Signal on Thin Validation

Forty-nine studies suggest disturbed sleep and activity shadow cognitive decline by years. Only three tested their model outside the lab that built it. The signal is real; the case that it works as a screening tool is not yet made.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.