Skip to main content
Journal Club5 min read

A Clinical AI That Knows When It Doesn't Know Enough

A hepatology decision-support system was built to stop answering when its evidence runs thin, and to flag the answers it gives anyway. The architecture is the interesting part. The evidence behind it is thirty questions, scored by its own makers.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hand paused above a keyboard, framed by a looping teal arc, with a column of guideline text and a single amber accent.

Three tries, then a confession. That is the behaviour at the centre of a new hepatology assistant described in Frontiers in Medicine: when the system cannot assemble enough grounded evidence to answer safely, it does not improvise a confident reply. It searches again, up to three times, and if it still falls short it answers from the model's own memory and tells the user that it did. For anyone who has watched a chatbot produce fluent, plausible, unsupported text at a clinical question, that small act of restraint is the headline — more than any score in the paper.

The failure mode the authors are designing against is the one every clinician has seen: the model that always answers. Ask it anything and it returns smooth prose, whether or not its knowledge supports a safe reply. A system that offers a plausible but unsafe answer with full confidence is worse than one that stops and says it does not yet have enough to go on. Building a clinical tool that prefers the second behaviour over the first is a genuinely good instinct, and most chatbots do the opposite by default.

The design choice that matters

The system has two moving parts. One is a domain knowledge graph — 12,192 entities and 28,770 relations — distilled from 53 clinical guidelines on liver disease drawn from PubMed over the previous fifteen years. The other is an agent that queries that graph in a loop the authors call retrieve-evaluate-refine: it retrieves, judges whether what it has is sufficient, and if not reformulates the query and tries again, to a ceiling of three passes. The part worth dwelling on is what happens when three passes fail. Rather than going silent, the system falls back to the language model's own parametric knowledge and attaches an explicit warning that the answer was not graph-verified. It separates a grounded answer from a fallback answer and shows the reader which one is on the screen.

Underneath, this is Retrieval-Augmented Generation — the technique of grounding a model's output in retrieved documents rather than its training alone — with two refinements: a structured knowledge graph in place of loose text passages, and an agent allowed to notice its own ignorance and retry. Self-correcting is a fair label for it. Validated is not, and the rest of the paper is where that distinction has to be drawn carefully.

Thirty questions, marked by the people who set them

The evaluation is where the work declares itself a proof of concept. Two hepatologists, blinded to which system wrote which answer, scored the outputs on five-point scales for accuracy, completeness and safety. The proposed framework led on all three, most clearly on safety, where it scored 4.9 against 4.4 for a graph-based baseline, 4.3 for standard retrieval and 4.1 for GPT-4 alone; it also posted strong automated RAGAS metrics — faithfulness 0.94, context recall 0.92, answer relevancy 0.91. Taken on its own terms the pattern is internally consistent: a system built to ground its answers and abstain when it cannot does produce answers two clinicians judged safer.

Then the denominator arrives. The whole evaluation rests on thirty questions — ten factual, ten multi-step, ten deliberately ambiguous. Thirty. At that size the gap between 4.4 and 4.9 is a direction, not a measured effect, and no confidence interval will turn it into one; the authors say as much, calling the dataset insufficient to establish broad statistical significance. Worse for inference, the two clinicians who scored the answers are the same pair who constructed the questions, the automated metrics run on that same thirty-item set, and the only comparators are GPT-4 and two retrieval variants — no external, state-of-the-art clinical question-answering system is in the room. A favourable result on your own benchmark, marked by your own examiners, is where inquiry begins, not where it ends.

Whose guidelines, whose medicine

A graph is only as trustworthy as the guidelines poured into it, and this one was built at the Anhui University of Chinese Medicine in Hefei from 53 documents the paper does not fully attribute to named national bodies. For a European reader that is not a footnote. A liver-disease assistant grounded in one jurisdiction's guidelines quietly inherits that jurisdiction's thresholds and drug choices; ground truth in clinical medicine is jurisdictional, and a system that looks authoritative can be authoritatively wrong for your patients. The authors flag the adjacent risk themselves — that any guideline-bound system trails the newest evidence, and that assembling the graph demanded heavy expert labour. These are the honest constraints of careful early work, declared without commercial conflict and funded by provincial research grants.

A system that confidently offers a plausible but unsafe answer is more dangerous than one that admits it does not yet know enough.

Why it matters

The transferable lesson is neither the hepatology graph nor the 4.9. It is a question to put to any clinical AI a hospital is shown: does it know when it does not know, and does it tell you? A model that abstains, retries, and labels an ungrounded answer as ungrounded is doing something an unguarded chatbot cannot — managing its own uncertainty in a way a clinician can audit and override. Whether this particular implementation deserves clinical trust is exactly what thirty self-graded questions cannot decide. But the property it reaches for is the one worth demanding, and worth measuring properly, long before anything like it comes near a patient.

Source: Hu Y, Xuan W, Zhou Q, Li Z, Li Y, Hu J, Fang F. A self-correcting Agentic Graph RAG for clinical decision support in hepatology. Frontiers in Medicine 2025;12:1716327. A peer-reviewed proof-of-concept evaluated on thirty questions scored by the developing team, with no external comparator — strong as a design, preliminary as evidence.

#Journal Club#Clinical AI#Retrieval-Augmented Generation#Evidence-Based Medicine#Hepatology

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.