Skip to main content
Journal Club5 min read

Surgical AI That Works in the Paper but Not in the Room

A scoping review screened 275 records to find every AI model meant to prevent surgical complications and follow it to the bedside. Of 19 studies, the models were often accurate. Two are in routine use — and the bottleneck is not the algorithm.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a surgeon's gloved hands beside an anaesthesia monitor showing a teal arterial-pressure waveform, with a closed operating-room door suggested behind and a single amber accent.

Count the tools that have actually made it into a real operating room, and the number is two. A surgical risk calculator that has been around for a decade, and one commercial monitor that watches blood pressure. That is the working population of deployed artificial intelligence in this corner of surgery — not a list of nineteen, which is how many studies a new review managed to find, but two that a surgeon somewhere can use today without it being an experiment.

The review is by Mevik and colleagues in JMIR AI, and its question is deliberately narrow. Not whether AI can forecast a surgical complication — that case is largely made — but whether any of these forecasts has travelled the distance from a validation table to a patient on a list. They searched eleven databases, screened 275 records, and were left with 19 empirical studies of models used or trialled in real-world surgical settings, published between 2013 and the start of 2025. This is a scoping review: it maps a field rather than pooling an effect. For a question about the state of practice, that is the correct instrument, and it should be read as one.

An accurate field

On the narrow measure of technical accuracy, the literature looks healthy. The dominant subject is intra-operative hypotension — a low blood pressure during surgery associated with acute kidney injury, myocardial damage and longer intensive-care stays — and it dominates by some margin: eleven of the nineteen studies, well over half, examined the Hypotension Prediction Index. Models forecasting hypotension reached an area under the receiver-operating curve (AUROC, a measure of how well a model separates events from non-events; 1.0 is perfect, 0.5 is a coin toss) around 0.89, and several reduced the time patients spent hypotensive. The risk and decision-support tools posted comparable figures: MySurgeryRisk between 0.80 and 0.92 across outcomes, and POTTER outperforming surgeons at 0.88 versus 0.84 for mortality and 0.93 versus 0.83 for ventilator dependence. As discrimination, these are respectable numbers.

But discrimination is not clinical usefulness, and the review is honest about the difference. MyRISK makes the point cleanly. Its sensitivity is 94 percent — it misses almost no true case — and its negative predictive value 99 percent. Its positive predictive value is 7 percent. At the prevalence of the events these models chase, that combination means the great majority of alarms are false: for every true warning, the surgeon dismisses a dozen that were not. A number that looks commanding on a curve becomes, in the room, a noise the team learns to silence.

The absence at the centre

The review's principal finding is what is missing. External validation — testing a model on data from a hospital that did not build it — was infrequent, and no included study fully met the TRIPOD+AI standard for transparent reporting of a prediction model. Only two of the nineteen studies asked the clinicians who would use these tools what they made of them; both reported favourable impressions, which tells you the question is answerable, not that it has been answered. And the barriers the authors name are not computational. They are manual data entry that does not scale, no link to the electronic record, regulatory and ethical uncertainty that the authors call the single most common reason for non-adoption, and the absence of any reimbursement to pay for running the system. The distance between an AUROC and a better outcome is not closed by a better algorithm. It is closed by interfaces, external validation, training and a budget.

There is also a tilt in the evidence worth naming, because the authors name it. The reason hypotension prediction crowds out everything else is partly that one product is commercially available and well integrated — and most of the studies testing it were funded by its manufacturer. The competing-interests statement records the authors' own funding as a regional health-authority grant with no industry stake; the skew sits in the underlying literature, not in the review. But it means the field's most-studied success is also its most-sponsored one, and a reader should weigh the eleven HPI papers as a body of manufacturer-supported work rather than as eleven independent verdicts.

A number that looks commanding on a curve becomes, in the room, a noise the team learns to silence.

Why it matters here

Under the Medical Device Regulation (MDR) and the EU AI Act, software that guides surgical decisions sits in the high-risk class, with demanding requirements for conformity assessment, clinical evidence and post-market surveillance. Those requirements are hard for any team and close to impossible for an academic group with no commercial sponsor behind a promising model — which is the more likely reason the most accurate tools never leave the pilot, mathematics having little to do with it. For anyone weighing surgical AI, the useful discipline is to read past the AUROC and ask the unglamorous questions: validated externally on whom, wired into which record, judged by which clinicians, and paid for how. Two studies in this review thought to ask the surgeons. That is the line to start from, not the accuracy table.

Source: Mevik K, Woldaregay AZ, Jonsson EL, Tejedor M, Temple-Oberle C. Application of AI Models for Preventing Surgical Complications: Scoping Review of Clinical Readiness and Barriers to Implementation. JMIR AI 2026;5:e75064. A scoping review — it maps and appraises the existing literature rather than pooling effects, so its central claim is about the state of the field, not the size of any single benefit.

#Journal Club#Clinical AI#Surgery#Implementation Science#Evidence-Based Medicine

Keep reading

Editorial collage of an oncologist's hands on a thick claims ledger, with a teal three-column bar chart rising only partway and a single amber accent.
Journal Club

An Explainable Model, Honest Numbers, and a Funder Worth Noticing

An explainable AI model predicted how long myeloma patients would stay on treatment, using twenty years of Japanese claims data and 647 variables. The discrimination is modest and fairly reported. The part that needs a careful eye is who paid, and which finding they got.

Dr. Sven JungmannCEO
Editorial collage of four people mid-conversation arranged around a teal circle with a single amber dot at its centre.
Journal Club

Four Conversations About Clinical AI That Quietly Agree

Four NEJM AI podcast interviews, recorded months apart, keep landing in the same three places: a values vacuum, a bias we taught the machine, and a trust gap that tracks consequence. None of it is evidence. The agreement is still worth an hour.

Dr. Sven JungmannCEO
Editorial collage of a tall column of faded document fragments narrowing through teal sieve layers down to a single white card marked by one amber dot.
Journal Club

Depression From Text: Why 3,067 Studies Came Down to 11

A meta-analysis of machine learning for detecting depression in text screened 3,067 papers and kept 11. The pooled signal is strong — but the prediction interval, from near-zero to near-perfect, is the finding that should travel.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.