Journal Club5 June 20265 min read

Surgical AI That Works in the Paper but Not in the Room

A scoping review screened 275 records to find every AI model meant to prevent surgical complications and follow it to the bedside. Of 19 studies, the models were often accurate. Two are in routine use — and the bottleneck is not the algorithm.

Dr. Sven Jungmann

CEO

Editorial collage of a surgeon's gloved hands beside an anaesthesia monitor showing a teal arterial-pressure waveform, with a closed operating-room door suggested behind and a single amber accent.

Count the tools that have actually made it into a real operating room, and the number is two. A surgical risk calculator that has been around for a decade, and one commercial monitor that watches blood pressure. That is the working population of deployed artificial intelligence in this corner of surgery — not a list of nineteen, which is how many studies a new review managed to find, but two that a surgeon somewhere can use today without it being an experiment.

The review is by Mevik and colleagues in JMIR AI, and its question is deliberately narrow. Not whether AI can forecast a surgical complication — that case is largely made — but whether any of these forecasts has travelled the distance from a validation table to a patient on a list. They searched eleven databases, screened 275 records, and were left with 19 empirical studies of models used or trialled in real-world surgical settings, published between 2013 and the start of 2025. This is a scoping review: it maps a field rather than pooling an effect. For a question about the state of practice, that is the correct instrument, and it should be read as one.

An accurate field

On the narrow measure of technical accuracy, the literature looks healthy. The dominant subject is intra-operative hypotension — a low blood pressure during surgery associated with acute kidney injury, myocardial damage and longer intensive-care stays — and it dominates by some margin: eleven of the nineteen studies, well over half, examined the Hypotension Prediction Index. Models forecasting hypotension reached an area under the receiver-operating curve (AUROC, a measure of how well a model separates events from non-events; 1.0 is perfect, 0.5 is a coin toss) around 0.89, and several reduced the time patients spent hypotensive. The risk and decision-support tools posted comparable figures: MySurgeryRisk between 0.80 and 0.92 across outcomes, and POTTER outperforming surgeons at 0.88 versus 0.84 for mortality and 0.93 versus 0.83 for ventilator dependence. As discrimination, these are respectable numbers.

But discrimination is not clinical usefulness, and the review is honest about the difference. MyRISK makes the point cleanly. Its sensitivity is 94 percent — it misses almost no true case — and its negative predictive value 99 percent. Its positive predictive value is 7 percent. At the prevalence of the events these models chase, that combination means the great majority of alarms are false: for every true warning, the surgeon dismisses a dozen that were not. A number that looks commanding on a curve becomes, in the room, a noise the team learns to silence.

The absence at the centre

The review's principal finding is what is missing. External validation — testing a model on data from a hospital that did not build it — was infrequent, and no included study fully met the TRIPOD+AI standard for transparent reporting of a prediction model. Only two of the nineteen studies asked the clinicians who would use these tools what they made of them; both reported favourable impressions, which tells you the question is answerable, not that it has been answered. And the barriers the authors name are not computational. They are manual data entry that does not scale, no link to the electronic record, regulatory and ethical uncertainty that the authors call the single most common reason for non-adoption, and the absence of any reimbursement to pay for running the system. The distance between an AUROC and a better outcome is not closed by a better algorithm. It is closed by interfaces, external validation, training and a budget.

There is also a tilt in the evidence worth naming, because the authors name it. The reason hypotension prediction crowds out everything else is partly that one product is commercially available and well integrated — and most of the studies testing it were funded by its manufacturer. The competing-interests statement records the authors' own funding as a regional health-authority grant with no industry stake; the skew sits in the underlying literature, not in the review. But it means the field's most-studied success is also its most-sponsored one, and a reader should weigh the eleven HPI papers as a body of manufacturer-supported work rather than as eleven independent verdicts.

“A number that looks commanding on a curve becomes, in the room, a noise the team learns to silence.”

Why it matters here

Under the Medical Device Regulation (MDR) and the EU AI Act, software that guides surgical decisions sits in the high-risk class, with demanding requirements for conformity assessment, clinical evidence and post-market surveillance. Those requirements are hard for any team and close to impossible for an academic group with no commercial sponsor behind a promising model — which is the more likely reason the most accurate tools never leave the pilot, mathematics having little to do with it. For anyone weighing surgical AI, the useful discipline is to read past the AUROC and ask the unglamorous questions: validated externally on whom, wired into which record, judged by which clinicians, and paid for how. Two studies in this review thought to ask the surgeons. That is the line to start from, not the accuracy table.

Source: Mevik K, Woldaregay AZ, Jonsson EL, Tejedor M, Temple-Oberle C. Application of AI Models for Preventing Surgical Complications: Scoping Review of Clinical Readiness and Barriers to Implementation. JMIR AI 2026;5:e75064. A scoping review — it maps and appraises the existing literature rather than pooling effects, so its central claim is about the state of the field, not the size of any single benefit.

#Journal Club#Clinical AI#Surgery#Implementation Science#Evidence-Based Medicine

Surgical AI That Works in the Paper but Not in the Room

An accurate field

The absence at the centre

Why it matters here

Keep reading

Automation Bias at the Bedside: Why Edit Rates Near Zero Are a Warning Sign

Why aiomics for QM reports and quality analytics

Why aiomics for coding suggestions and §301 preparation

This analysis comes from the people behind Visite.

Want to see this in your hospital?