Skip to main content
Journal Club5 min read

Triage by Language Model: The Source Did the Work, Not the Model

A retrospective study grounded a triage language model in two sources at once — a local guideline and three thousand past cases. The honest finding is quieter: it beat an ungrounded model, the authors call it preclinical, and no patient was followed forward.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a triage nurse's hand on an emergency-department clipboard, with a teal guideline page and a navy stack of cases meeting at one seam marked by a single amber dot.

The interesting line in this paper is not the one the press release would pull. The same language model, asked the same triage question, agreed with expert nurses badly on its own and very well once it was handed two reference sources to reason over. Nothing about the model changed in between. That single contrast is the study worth reading, because it points at where the value sits — and it is not in the model.

The work is a retrospective evaluation in JMIR Medical Informatics from a tertiary emergency department in Hong Kong. The system, MECR-RAG, is a retrieval-augmented language model: instead of answering from its own weights, it first fetches evidence and reasons over it. Its one distinctive move is to retrieve from two sources together — the local Hong Kong emergency triage guideline, and a database of three thousand anonymised past triage encounters. The underlying model was DeepSeek-V3 in every arm, accessed off the shelf with no fine-tuning, which is exactly why the comparison is clean.

The test

This is a single-centre, retrospective, in-silico study. The grounded system and a prompt-only version of the same model were each asked to assign a five-level triage category to 236 routine encounters drawn from the department's 2023 records; the case database it could retrieve from came from 2024. Reference labels came from blinded senior triage nurses. The two primary outcomes were the quadratic weighted kappa — an agreement statistic that penalises larger disagreements more than near misses — and plain accuracy against those expert labels. For the 226 encounters that had enough follow-up, the authors added a severity tier built from what actually happened to the patient, so they could ask the harder question: not whether the system matched the nurses, but whether it caught the people who turned out to be sick. They are unambiguous about the tier of evidence this represents — preclinical, in-silico validation under the DECIDE-AI framework, a proof of concept and nothing more.

Where the gain came from

Take the ablation first, because it is the result that should change how a reader thinks. Strip the system back to the bare model and agreement collapsed. Add only the guideline and it rose; add only the case database and it rose; only the two together reached the top. Neither source carried the system alone. And when the authors swapped the engine — an exploratory check with Claude 3.7 Sonnet and GPT-4o — the grounded versions landed in the same vicinity, while removing retrieval cost far more than changing models ever did. The lever here is the evidence the model is grounded in, not the size or brand of the model. That is the reverse of where most of the field's effort and budget currently point.

The headline numbers are real and they are good. Grounded, the system reached a quadratic weighted kappa of 0.902 against the nurse reference, against 0.801 for the same model without retrieval (P<.001); accuracy climbed from 0.542 to 0.802. Its agreement landed in the band of the nurses' agreement with each other (interrater kappa 0.887), which the authors call expert-comparable rather than superior. The clinically meaningful shift was in overtriage: unnecessary high-priority assignments fell from 68 of 236 cases to 30, while undertriage stayed low (4 to 3). And on the measure that matters most — flagging the patients whose course turned out to be severe — the grounded system caught 124 of 130 high-severity cases (95.4 percent) against 117 of 130 (90.0 percent) for the nurses' initial triage (P=.02), at comparable specificity.

The same model, the same question: poor agreement alone, expert-comparable once grounded in a guideline and past cases. The lever was the evidence, not the model.

What it does not show

Every one of those figures is agreement with a triage label, not a patient's path through the department. Nobody was randomised, no waiting time was shortened, no missed infarction was prevented here. The severity analysis reaches toward clinical relevance, but the authors are candid that severe events were few and that both nurse consensus and their composite outcome are pragmatic proxies, not ground truth. They go further than most would, naming the Rothman Index and the Epic Sepsis Model as tools that looked strong offline and performed materially worse once independently evaluated after deployment. Strong agreement on a retrospective bench is precisely the result that has faded before when the system met a live queue.

The setting narrows it again. One emergency department, one local guideline, one language and documentation style, 236 archived encounters worked through offline at roughly a minute each. Germany runs the Manchester Triage System, not Hong Kong's; a system tuned to one guideline and case mix shows the architecture can work, not that it transfers. The retrieval database was built automatically from routine notes without clinician adjudication, so some retrieved precedents may be mislabelled — and the test cases predate the case database by a year, a temporal mismatch the authors flag themselves. None of this is a defect. It is the honest boundary of a proof of concept, drawn by the authors before anyone else could.

Why it matters

For anyone weighing decision support in their own department, the lesson to carry out is the ablation, not the kappa. If grounding in authoritative, local evidence moves the result more than the choice of model, then the work that decides whether such a tool is safe is unglamorous and clinical: which guideline, which past cases, curated and checked by whom, kept current by whom. That belongs to the people who run the triage desk, not to whoever supplies the model. And the only way to learn whether it helps is the way the authors insist on — prospectively, across more than one centre, measured against waiting times, crowding, length of stay and outcomes that actually reach the patient.

Source: Wong HS, Wong TK. Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study. JMIR Medical Informatics 2026;14:e82026. A single-centre, retrospective, preclinical (in-silico) evaluation; its primary endpoint was agreement with expert triage labels, not any prospectively measured patient outcome.

#Journal Club#Clinical AI#Emergency Medicine#Evidence-Based Medicine#Retrieval-Augmented Generation

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.