Journal Club31 May 20265 min read

The Governance Gap: Why Clinical AI Fails After It Passes Validation

A clinical model clears validation, goes live, and slowly drifts — and no one is assigned to watch. A narrative review maps why oversight, not algorithms, is now the binding constraint on healthcare AI. Read for what a review can and cannot prove.

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's still hands on a keyboard beneath a teal performance line drifting downward off a navy block, with a single amber accent marking the unnoticed dip.

Ask a hospital who owns the sepsis algorithm running in its electronic record, and you will often get a pause. Somebody procured it. Somebody switched it on. But the person whose job is to notice when it stops working as advertised frequently does not exist. That missing role — not the model's architecture — is the subject of a recent narrative review in the MDPI journal Sci, and it is the more dangerous of the two problems.

The case that makes the danger concrete is well documented. In 2021 a team at Michigan Medicine externally validated the Epic Sepsis Model, a warning tool embedded in the dominant US electronic record and already live at hundreds of hospitals. Switched on at the bedside, it had never been tested against the patients it would actually score. When the team finally checked, its discrimination was poor — an area under the receiver-operating curve (AUROC, a measure of how well a model separates cases from non-cases) of 0.63, with a sensitivity of roughly one in three. The model had not failed in the laboratory. It had simply never been governed once it left it.

What the authors set out to do

Bailo and five colleagues survey the published literature and the regulatory frameworks on how artificial intelligence is governed once it reaches real clinical environments. They searched five databases — Scopus, Web of Science, PubMed/MEDLINE, Embase and IEEE Xplore — for English-language work from 2018 to late 2025, screened in pairs, and added government and regulator documents by hand. Crucially, they applied no meta-analysis and no formal risk-of-bias assessment, and they say so. This is a narrative review with a described search, not a systematic review, and the distinction governs how much weight its conclusions can bear.

Why labour the point? Because a narrative review generates no primary data. It is an expert reading of a chosen body of work, organised into themes, and the choice of what to include and how to frame it rests on judgement. That is not a defect — a good narrative review is a map of confusing terrain, and this is a useful map drawn by people with no external funding and no declared conflicts of interest. But a map is not the terrain. Nothing here is evidence in the sense a clinician means at the bedside; it is a structured argument about where the evidence, and the accountability, have gone missing.

One thread runs through seven themes

The review sorts the governance problem into seven strands: bias and fairness, explainability, safety and quality, privacy and data protection, accountability and liability, human oversight, and procurement and deployment. Read across them and a single thread pulls tight. The hard problems are rarely technical limits on what can be built; they are organisational gaps in who answers for the system after it is built. Bias persists not mainly because subgroup performance cannot be measured but because no one is assigned to re-measure it once a model is live. Explainability tools multiply, yet being legible enough for a clinician to act on is not the same as being legible enough to hold someone to account when the model errs. And safety, the authors stress, is not a certificate stamped at approval; it is a property that has to be maintained, because models decay as the patients, the coding practices and the care pathways around them shift.

Their strongest move is to put post-deployment surveillance at the centre. The medical-device tradition treats approval as the milestone. Software that learns, or that a vendor updates silently, breaks that assumption: a model can be calibrated on Monday and miscalibrated by Friday, and no one-off validation will catch the change. This is exactly what the Michigan sepsis case demonstrates, and exactly what European instruments — the Medical Device Regulation (MDR) and the EU AI Act — are still learning to turn into routine practice.

“A model can be calibrated on Monday and miscalibrated by Friday, and no one-off validation will catch the change.”

Where the argument stops

Because it synthesises rather than measures, the review cannot tell you how common any of this is. It does not quantify how many deployed models are drifting, or how many hospitals have named an owner for that risk; it assembles illustrative cases — the Epic model among them, which it cites qualitatively rather than with its own figures — and reasons from them. Its recommendations are sound and widely shared: oversight needs named roles, defined escalation paths and scheduled subgroup audits. But these are proposals, not findings tested against outcomes. No study here shows that a hospital adopting them ends up with fewer harmed patients than one that does not. That trial has not been run.

There is also the interpretive limit common to any review of this kind: the choice of literature and the framing of the seven themes reflect the authors' reading, and another team might weight privacy above fairness, or liability above oversight, and tell a coherent but different story. The value lies in the framing, not in any claim to completeness — and the authors are candid that completeness was never the design.

What a European hospital should take from it

The lesson is not a new checklist; it is a question of ownership. The Michigan case is sobering precisely because the model was unremarkable — a default feature of a widely used record system, trusted because it shipped. The review's real contribution is to name the gap without flinching: the binding constraint on safe clinical AI is no longer the quality of the algorithm but whether someone, by name, is responsible for watching it after it goes live. That is an institutional decision, not a technical one, and most systems have yet to make it.

Source: Bailo P, Nittari G, Pesel G, Basello E, Spasari T, Ricci G. Governing Healthcare AI in the Real World: How Fairness, Transparency, and Human Oversight Can Coexist: A Narrative Review. Sci 2026;8(2):36. A narrative review — an expert synthesis of the literature with no meta-analysis and no formal risk-of-bias assessment, declaring no external funding and no conflicts of interest; its conclusions are a well-reasoned map of the governance gap, not measured evidence of its size. The Epic sepsis figures are drawn from the primary source, Wong et al., JAMA Internal Medicine 2021, not from this review.

#Journal Club#AI Governance#Patient Safety#Health Policy#Evidence-Based Medicine

The Governance Gap: Why Clinical AI Fails After It Passes Validation

What the authors set out to do

One thread runs through seven themes

Where the argument stops

What a European hospital should take from it

Keep reading

Why aiomics for QM reports and quality analytics

The 4 p.m. Hazard: When Bad Software Becomes a Clinical Risk

The Value of AI Isn't Prediction. It's Cognitive Ergonomics.

This analysis comes from the people behind Visite.

Want to see this in your hospital?