An AUROC of 0.805, Sitting on 97 Percent Heterogeneity
Twenty-eight machine-learning models claim to predict delirium after heart surgery. Pooled, they look clinically useful. Read the validation methods and the heterogeneity, and the single number stops meaning what it appears to.

Dr. Sven Jungmann
CEO

Twenty-one of these studies bothered to test their model on data the model had not seen during training. Exactly one of those twenty-one used data from a hospital that did not build it. That single number — one externally validated model out of twenty-eight — tells you most of what you need to know before you take the headline seriously. The headline is a pooled discrimination of 0.805 for predicting delirium after cardiac surgery, and on its own it looks like a tool worth having.
The condition is worth predicting. Roughly one in twelve people who undergo heart surgery wakes into a delirium in the days that follow — in this pooled population, 6,326 of 80,143 patients. It lengthens intensive-care stays, raises mortality, and shadows recovery for months. Flag the patients heading for it, and you can act early: lighter sedation, earlier mobilisation, targeted prophylaxis. So when Guo and colleagues set out to weigh the prediction models built for exactly this, the question they were really testing was not whether the idea is good. It is whether the tools are.
How the review was done, and what holds up
Rather than train one more model, the authors systematically reviewed 28 of them, drawn from studies covering 80,143 cardiac-surgery patients across twelve countries between 2012 and 2024. The pooled C-index — the area under the receiver-operating-characteristic curve, AUROC, which measures how cleanly a model separates the patients who will develop delirium from those who will not — came to 0.805 (95% CI 0.759–0.852) on the validation data, with a pooled sensitivity of 0.72 and specificity of 0.78. Read literally, that is a model catching roughly three of every four patients on their way to trouble.
Two things genuinely hold up. The recurring predictors are unglamorous and already in the chart — age, creatinine, time on cardiopulmonary bypass, the Mini-Mental State Examination score, left-ventricular ejection fraction. None demands a new test or a new sensor; they sit in almost every cardiac record, which means a model built on them could in principle run without disrupting anyone's day. And the review does not simply average its studies and declare a winner. It runs a formal risk-of-bias appraisal — and that appraisal is where the figure of 0.805 comes apart.
Why the single number does not travel
Start with the heterogeneity. The I-squared statistic — the share of variation between studies that reflects genuine differences rather than chance — reached 97.3 percent on the validation data, and 98.8 percent on the training data. The authors call it extreme, and the word is earned. At that level the 28 models are not one method measured under 28 conditions; they are 28 different things being added together. Pooling them yields a clean central estimate with a confidence interval, but that centre is not a property any single model would carry to any single bedside. It is the middle of a cloud, not the performance of a tool.
The bias appraisal arrives at the same place from the other side. Under PROBAST — the Prediction Model Risk of Bias Assessment Tool — 26 of the 28 models were rated high risk; only two were low. The faults are the familiar ones: too few delirium events per candidate variable, so the model fits noise; thin handling of missing data; weak guards against overfitting. And the validation picture is the part most worth dwelling on. Of the 21 studies that reported a validation set, 11 split their own data at random, 4 divided it by time, 3 used k-fold cross-validation, 2 resampled by bootstrap — and exactly one tested its model on an external cohort. Twenty-three of the 28 studies were single-centre. What the other twenty validations measure is how well a model memorised its own hospital, not whether it will work in yours.
“A pooled C-index of 0.805 is not a stable estimate. It is the mathematical centre of results too different to belong in the same average.”
What it means for the reader
This is the line between a benchmark and a clinical tool, and it is the line a reader who stops at the abstract will miss. A discrimination above 0.8 is a real and encouraging signal that the problem is learnable from routine data. It is not evidence that any one of these models discriminates at 0.8 on a ward it has never seen, where the baseline delirium rate, the surgical mix, and the way delirium is even assessed all differ from the training set — and assessment method does vary across these studies, which by itself shifts the apparent incidence. Guo and colleagues draw exactly the conclusion their data permit: the performance is promising, the evidence base too thin and too internally validated to act on. The honest next step is not deployment. It is external, multicentre, where possible prospective validation — before a single one of these scores is allowed to change what happens to a patient at four in the morning.
Source: Guo Y, Xu H, Wang A, Zhang M, Zhang S, Xie P. The Predictive Value of Machine Learning for Postoperative Delirium in Cardiac Surgery: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2026;28:e72304. A systematic review and meta-analysis: its pooled estimate rests on 97 percent heterogeneity and on a model set in which 26 of 28 studies were judged at high risk of bias and only one was externally validated.


