Journal Club24 June 20265 min read

A Graph and a Search Engine Walk Into a Patient Record: Reading a Clinical AI Proof-of-Concept

A clinical AI retrieved every fact it was asked for and two physicians flagged nothing unsafe. Both results are real. Both come from ten de-identified patients on public research data — which is exactly what proof-of-concept means.

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hands reaching toward a patient record split into a teal node lattice and a halftone column of narrative text, with a single amber dot marking one node.

One of the two physicians who reviewed this system gave it a 3.3 out of 5 for relevance and conciseness. That unglamorous number is the most useful line in the paper. The system found everything it was asked to find — and then kept talking. In a clinical summary, a machine that says too much is not merely tedious; it buries the one sentence a tired clinician needed to see.

The work is worth reading precisely because it is honest about its own scale. The claim that circulated alongside it — that an AI answered question after question without inventing anything — is true in a narrow, carefully bounded sense. The value of the paper is in marking exactly where that boundary sits.

The problem, and the design

A hospital record stores its truth in two incompatible forms. One is structured: coded diagnoses, lab values, medication lists — rows a computer can query. The other is prose: discharge letters, radiology reports, the registrar's note at three in the morning. Most retrieval tools read one world or the other, and the answer to a real clinical question usually straddles both.

The system, MediGRAF, runs two retrieval paths at once. Structured data sits in a Neo4j graph database, where a language model (GPT-4o-mini) translates a plain-language question into a Cypher query — Cypher being that database's query language. The free-text documents are turned into vector embeddings and searched by semantic similarity. The two result sets are merged, and the model then writes the final answer. Crucially, the test data were not live hospital records but ten de-identified patients drawn from MIMIC-IV, a widely used public research dataset: 25 discharge summaries and 64 radiology reports, rendered as a graph of roughly 5,970 nodes.

It is a proof-of-concept, and the authors say so plainly. No control arm of clinicians, no patient followed forward, no second site. The promise is deliberately small: that the architecture fetches the right information, and that on a handful of hard questions two doctors found nothing dangerous in the output.

What the evidence supports

Take the retrieval claim first, because it is the cleanest. For the simple and medium questions — the ones where a ground-truth set of records exists to check against — the hybrid system reached perfect recall (1.0): every relevant fact was pulled back. The instructive comparison is the graph query on its own, without the semantic search attached: it managed 0.8 recall on the simple questions and 0.688 on the medium ones, where its accuracy fell to 51.6 percent. The free-text path was not decoration; it closed the gap the structured query left open. That is the paper's genuine contribution — a specific demonstration that searching both worlds together beats searching the structured world alone.

The second finding is separate, and the paper keeps it separate even where the headline did not. Two hospital physicians scored the ten hardest, inference-style answers on five-point scales. Overall quality landed around 4.2 to 4.3 out of 5, and — the line that travelled — neither reviewer marked any of the ten as unsafe. For a technology whose signature failure is the confident fabrication, zero critical hallucinations in ten answers is worth recording.

What it does not support

These are two different measurements on two different question sets, and they answer two different questions. Perfect recall is a retrieval metric: the right facts were fetched. "No unsafe answers" is a clinical judgement by two readers on ten complex cases. Neither tells you what happens at the eleventh hard question or the two-hundredth patient. Ten patients and two raters is exactly the sample size on which a rare, dangerous error stays invisible — and rare-but-dangerous is the error that matters.

The dataset narrows the claim further. MIMIC-IV is clean, curated, single-institution research material; a working record is messier, multilingual across much of Europe, thick with the abbreviations and contradictions that break retrieval. And the verbosity that earned that 3.3 is not a cosmetic flaw — the authors name it as the main thing needing work. They concede a second one: the system cannot yet say cleanly which fact came from the graph and which from the free-text search, so a clinician cannot fully trace an answer back to its source. For software meant to be checked at the bedside, that traceability is not a refinement to add later. It is the feature.

“Perfect recall and no unsafe answers are real results — on ten patients, from a public research dataset, with a machine that talks too much. That is exactly what a proof-of-concept is allowed to claim, and not a step more.”

Why it matters here

The underlying problem is one every European clinician knows: the answer is somewhere in the record, split between a coded field and a paragraph of prose, and finding it costs time nobody has. A method that searches both reliably is a sensible direction, and a clean comparison against a graph-only baseline is the kind of evidence the field is short of. The honest reading is that this earns a next step, not a deployment — a larger and messier dataset, more raters, real records in the languages they are written in, and source attribution good enough that a doctor can check the machine rather than trust it. The authors say as much themselves. Showing that something can be built is not the same as showing it is safe to use, and they are admirably clear about which one they have done.

Source: Thio S, Lewis M, Denaxas S, Dobson RJB. Unlocking electronic health records: a hybrid graph RAG approach to safe clinical AI for patient QA. Frontiers in Digital Health 2026;8:1780700. A single-team proof-of-concept on ten de-identified patients from a public research dataset, scored by two clinicians; it measures retrieval and reviewer-rated safety, not clinical outcomes. One author is employed by CogStack Limited; the remaining authors declare no competing interests.

#Journal Club#Clinical AI#Retrieval-Augmented Generation#Electronic Health Records#Evidence-Based Medicine

A Graph and a Search Engine Walk Into a Patient Record: Reading a Clinical AI Proof-of-Concept

The problem, and the design

What the evidence supports

What it does not support

Why it matters here

Keep reading

The Productivity Gap: Why Better Tools Have Not Made Better Institutions

AI Agents Arrived Before the Evidence Did

When a Software Flaw Becomes a Clinical Event: One NHS Cyber Comment, Read Closely

This analysis comes from the people behind Visite.

Want to see this in your hospital?