Surgical AI That Works in the Paper but Not in the Room
A scoping review screened 275 records to find every AI model meant to prevent surgical complications and follow it to the bedside. Of 19 studies, the models were often accurate. Two are in routine use — and the bottleneck is not the algorithm.

Dr. Sven Jungmann
CEO

Count the tools that have actually made it into a real operating room, and the number is two. A surgical risk calculator that has been around for a decade, and one commercial monitor that watches blood pressure. That is the working population of deployed artificial intelligence in this corner of surgery — not a list of nineteen, which is how many studies a new review managed to find, but two that a surgeon somewhere can use today without it being an experiment.
The review is by Mevik and colleagues in JMIR AI, and its question is deliberately narrow. Not whether AI can forecast a surgical complication — that case is largely made — but whether any of these forecasts has travelled the distance from a validation table to a patient on a list. They searched eleven databases, screened 275 records, and were left with 19 empirical studies of models used or trialled in real-world surgical settings, published between 2013 and the start of 2025. This is a scoping review: it maps a field rather than pooling an effect. For a question about the state of practice, that is the correct instrument, and it should be read as one.
An accurate field
On the narrow measure of technical accuracy, the literature looks healthy. The dominant subject is intra-operative hypotension — a low blood pressure during surgery associated with acute kidney injury, myocardial damage and longer intensive-care stays — and it dominates by some margin: eleven of the nineteen studies, well over half, examined the Hypotension Prediction Index. Models forecasting hypotension reached an area under the receiver-operating curve (AUROC, a measure of how well a model separates events from non-events; 1.0 is perfect, 0.5 is a coin toss) around 0.89, and several reduced the time patients spent hypotensive. The risk and decision-support tools posted comparable figures: MySurgeryRisk between 0.80 and 0.92 across outcomes, and POTTER outperforming surgeons at 0.88 versus 0.84 for mortality and 0.93 versus 0.83 for ventilator dependence. As discrimination, these are respectable numbers.
But discrimination is not clinical usefulness, and the review is honest about the difference. MyRISK makes the point cleanly. Its sensitivity is 94 percent — it misses almost no true case — and its negative predictive value 99 percent. Its positive predictive value is 7 percent. At the prevalence of the events these models chase, that combination means the great majority of alarms are false: for every true warning, the surgeon dismisses a dozen that were not. A number that looks commanding on a curve becomes, in the room, a noise the team learns to silence.
The absence at the centre
The review's principal finding is what is missing. External validation — testing a model on data from a hospital that did not build it — was infrequent, and no included study fully met the TRIPOD+AI standard for transparent reporting of a prediction model. Only two of the nineteen studies asked the clinicians who would use these tools what they made of them; both reported favourable impressions, which tells you the question is answerable, not that it has been answered. And the barriers the authors name are not computational. They are manual data entry that does not scale, no link to the electronic record, regulatory and ethical uncertainty that the authors call the single most common reason for non-adoption, and the absence of any reimbursement to pay for running the system. The distance between an AUROC and a better outcome is not closed by a better algorithm. It is closed by interfaces, external validation, training and a budget.
There is also a tilt in the evidence worth naming, because the authors name it. The reason hypotension prediction crowds out everything else is partly that one product is commercially available and well integrated — and most of the studies testing it were funded by its manufacturer. The competing-interests statement records the authors' own funding as a regional health-authority grant with no industry stake; the skew sits in the underlying literature, not in the review. But it means the field's most-studied success is also its most-sponsored one, and a reader should weigh the eleven HPI papers as a body of manufacturer-supported work rather than as eleven independent verdicts.
“A number that looks commanding on a curve becomes, in the room, a noise the team learns to silence.”
Why it matters here
Under the Medical Device Regulation (MDR) and the EU AI Act, software that guides surgical decisions sits in the high-risk class, with demanding requirements for conformity assessment, clinical evidence and post-market surveillance. Those requirements are hard for any team and close to impossible for an academic group with no commercial sponsor behind a promising model — which is the more likely reason the most accurate tools never leave the pilot, mathematics having little to do with it. For anyone weighing surgical AI, the useful discipline is to read past the AUROC and ask the unglamorous questions: validated externally on whom, wired into which record, judged by which clinicians, and paid for how. Two studies in this review thought to ask the surgeons. That is the line to start from, not the accuracy table.
Source: Mevik K, Woldaregay AZ, Jonsson EL, Tejedor M, Temple-Oberle C. Application of AI Models for Preventing Surgical Complications: Scoping Review of Clinical Readiness and Barriers to Implementation. JMIR AI 2026;5:e75064. A scoping review — it maps and appraises the existing literature rather than pooling effects, so its central claim is about the state of the field, not the size of any single benefit.


