Journal Club25 May 20265 min read

Triage by Language Model: The Source Did the Work, Not the Model

A retrospective study grounded a triage language model in two sources at once — a local guideline and three thousand past cases. The honest finding is quieter: it beat an ungrounded model, the authors call it preclinical, and no patient was followed forward.

Dr. Sven Jungmann

CEO

Editorial collage of a triage nurse's hand on an emergency-department clipboard, with a teal guideline page and a navy stack of cases meeting at one seam marked by a single amber dot.

The interesting line in this paper is not the one the press release would pull. The same language model, asked the same triage question, agreed with expert nurses badly on its own and very well once it was handed two reference sources to reason over. Nothing about the model changed in between. That single contrast is the study worth reading, because it points at where the value sits — and it is not in the model.

The work is a retrospective evaluation in JMIR Medical Informatics from a tertiary emergency department in Hong Kong. The system, MECR-RAG, is a retrieval-augmented language model: instead of answering from its own weights, it first fetches evidence and reasons over it. Its one distinctive move is to retrieve from two sources together — the local Hong Kong emergency triage guideline, and a database of three thousand anonymised past triage encounters. The underlying model was DeepSeek-V3 in every arm, accessed off the shelf with no fine-tuning, which is exactly why the comparison is clean.

The test

This is a single-centre, retrospective, in-silico study. The grounded system and a prompt-only version of the same model were each asked to assign a five-level triage category to 236 routine encounters drawn from the department's 2023 records; the case database it could retrieve from came from 2024. Reference labels came from blinded senior triage nurses. The two primary outcomes were the quadratic weighted kappa — an agreement statistic that penalises larger disagreements more than near misses — and plain accuracy against those expert labels. For the 226 encounters that had enough follow-up, the authors added a severity tier built from what actually happened to the patient, so they could ask the harder question: not whether the system matched the nurses, but whether it caught the people who turned out to be sick. They are unambiguous about the tier of evidence this represents — preclinical, in-silico validation under the DECIDE-AI framework, a proof of concept and nothing more.

Where the gain came from

Take the ablation first, because it is the result that should change how a reader thinks. Strip the system back to the bare model and agreement collapsed. Add only the guideline and it rose; add only the case database and it rose; only the two together reached the top. Neither source carried the system alone. And when the authors swapped the engine — an exploratory check with Claude 3.7 Sonnet and GPT-4o — the grounded versions landed in the same vicinity, while removing retrieval cost far more than changing models ever did. The lever here is the evidence the model is grounded in, not the size or brand of the model. That is the reverse of where most of the field's effort and budget currently point.

The headline numbers are real and they are good. Grounded, the system reached a quadratic weighted kappa of 0.902 against the nurse reference, against 0.801 for the same model without retrieval (P<.001); accuracy climbed from 0.542 to 0.802. Its agreement landed in the band of the nurses' agreement with each other (interrater kappa 0.887), which the authors call expert-comparable rather than superior. The clinically meaningful shift was in overtriage: unnecessary high-priority assignments fell from 68 of 236 cases to 30, while undertriage stayed low (4 to 3). And on the measure that matters most — flagging the patients whose course turned out to be severe — the grounded system caught 124 of 130 high-severity cases (95.4 percent) against 117 of 130 (90.0 percent) for the nurses' initial triage (P=.02), at comparable specificity.

“The same model, the same question: poor agreement alone, expert-comparable once grounded in a guideline and past cases. The lever was the evidence, not the model.”

What it does not show

Every one of those figures is agreement with a triage label, not a patient's path through the department. Nobody was randomised, no waiting time was shortened, no missed infarction was prevented here. The severity analysis reaches toward clinical relevance, but the authors are candid that severe events were few and that both nurse consensus and their composite outcome are pragmatic proxies, not ground truth. They go further than most would, naming the Rothman Index and the Epic Sepsis Model as tools that looked strong offline and performed materially worse once independently evaluated after deployment. Strong agreement on a retrospective bench is precisely the result that has faded before when the system met a live queue.

The setting narrows it again. One emergency department, one local guideline, one language and documentation style, 236 archived encounters worked through offline at roughly a minute each. Germany runs the Manchester Triage System, not Hong Kong's; a system tuned to one guideline and case mix shows the architecture can work, not that it transfers. The retrieval database was built automatically from routine notes without clinician adjudication, so some retrieved precedents may be mislabelled — and the test cases predate the case database by a year, a temporal mismatch the authors flag themselves. None of this is a defect. It is the honest boundary of a proof of concept, drawn by the authors before anyone else could.

Why it matters

For anyone weighing decision support in their own department, the lesson to carry out is the ablation, not the kappa. If grounding in authoritative, local evidence moves the result more than the choice of model, then the work that decides whether such a tool is safe is unglamorous and clinical: which guideline, which past cases, curated and checked by whom, kept current by whom. That belongs to the people who run the triage desk, not to whoever supplies the model. And the only way to learn whether it helps is the way the authors insist on — prospectively, across more than one centre, measured against waiting times, crowding, length of stay and outcomes that actually reach the patient.

Source: Wong HS, Wong TK. Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study. JMIR Medical Informatics 2026;14:e82026. A single-centre, retrospective, preclinical (in-silico) evaluation; its primary endpoint was agreement with expert triage labels, not any prospectively measured patient outcome.

#Journal Club#Clinical AI#Emergency Medicine#Evidence-Based Medicine#Retrieval-Augmented Generation

Triage by Language Model: The Source Did the Work, Not the Model

The test

Where the gain came from

What it does not show

Why it matters

Keep reading

Why aiomics for QM reports and quality analytics

The 4 p.m. Hazard: When Bad Software Becomes a Clinical Risk

The Value of AI Isn't Prediction. It's Cognitive Ergonomics.

This analysis comes from the people behind Visite.

Want to see this in your hospital?