Passing the Exam Is Not the Same as Working the Ward
A systematic review of 39 medical AI benchmarks finds the same pattern everywhere: models that score 84-90 percent on licensing-style exams fall to 45-69 percent on tasks that resemble clinical work — and to 40-50 percent on safety. The gap is structural, not a fluke.

Dr. Sven Jungmann
CEO

Take a fixed clinical domain, keep the medical content constant, and change only the answer format — from multiple choice to free text. Accuracy falls by 30 to 40 percentage points. Nothing about the underlying knowledge has changed; what has changed is whether the candidate answers are sitting on the page waiting to be eliminated. That single observation, drawn from a systematic review in the Journal of Medical Internet Research, is the cleanest evidence yet that a large share of what we read as medical competence in a language model is the format flattering the score.
It also reframes a comparison that keeps getting told wrong. Yes, a top model answers United States licensing questions at around 96 percent — OpenAI's o1-preview on the MedQA set. And yes, the strongest reasoning models manage only about 46 percent on DiagnosisArena, a benchmark of hard published diagnostic cases on which practising physicians average roughly 20 percent. Those are two different instruments, not one system being unmasked. The review's contribution is to stop treating either number as a verdict and to measure the distance between the kind of test each represents.
What the review actually did
Gong, Bang, Lee and Baik at Hallym University screened 3,917 records and analysed 39 medical benchmarks for large language models — together more than 2.3 million questions spanning 45 languages, 172 specialties and 22 countries across six continents. It is a narrative synthesis rather than a meta-analysis, because the metrics were too heterogeneous to pool into a single effect size; the authors say so plainly. What they could do, and did, was sort the benchmarks by what they are really testing, and the sorting is where the result lives.
Twenty-one of the thirty-nine are knowledge-based: recall and reasoning in the register of a licensing exam — pharmacology, pathophysiology, guideline facts, multiple-choice items with one defensible answer. Leading models score 84 to 90 percent on these, at or above average physician performance. Fifteen are practice-based, built to approximate clinical work: multi-turn diagnostic conversations, navigating a structured patient record, deciding where the question is never posed cleanly. There, success rates fall to 45 to 69 percent. The remaining three are hybrids. The gap between the two columns is the paper.
The figure to hold onto
On the tasks that test safety — contraindication recognition, harm avoidance, risk communication, the parts of medicine where being wrong is least forgivable — the models score 40 to 50 percent. For scale, the medication-error rate in well-run hospital systems sits somewhere around 0.1 to 1 percent; that comparison is mine, not the paper's, but the order of magnitude carries the point. A second human reviewing layer does not paper over a gap of that size. It tells you which decisions cannot be delegated at all, whatever the model scored on a knowledge test.
There is a quieter trap the authors call the empathy paradox. In blinded evaluations, language models consistently outscore physicians on empathy and patient-satisfaction ratings — and empathy genuinely tracks better outcomes. But a warm, fluent reply to a description of chest discomfort that misses the urgency has done well on the wrong scale. High communication marks can mask diagnostic weakness, which is an argument for measuring the right endpoint, not an argument against kind machines.
“Examination scores are insufficient and misleading proxies for clinical readiness — the authors' phrasing, and the sentence to keep.”
What it does not establish
Being a synthesis of published benchmarks, the review inherits their limits rather than transcending them. The benchmarks judge models in isolation, not embedded in a clinical workflow with its interruptions and incentives; static test sets go stale as medicine moves; and ten of the thirty-nine — 26 percent — reported their methods too thinly to appraise fully. For European readers one number is worth naming: 33 percent of the benchmarks originate in North America and 31 percent in Asia, against 13 percent in Europe, with just five of European origin. Performance measured chiefly in US and East Asian systems does not transfer cleanly to European documentation standards and care pathways, which makes European validation a practical quality requirement rather than a formality.
Why it matters here
The hopeful part is that the measurement is catching up: 59 percent of the benchmarks appeared after 2023, and the better instruments now exist. MedAgentBench runs inside a Fast Healthcare Interoperability Resources (FHIR)-compliant virtual record assembled from more than 700,000 data elements from Stanford Hospital, scoring 300 clinically derived tasks; HealthBench grades 5,000 multi-turn conversations against rubrics built by 262 physicians across 60 countries. Those are the right things to put a system in front of. On this evidence the authors' conclusion is hard to dispute: autonomous clinical deployment is not currently justifiable, and human-in-the-loop oversight is the evidence-based position rather than the timid one. Passing the exam and treating the patient were never the same act; we long ago stopped confusing them in people. What is new is that, with AI, the exam keeps being handed back as the proof.
Source: Gong EJ, Bang CS, Lee JJ, Baik GH. Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks. J Med Internet Res 2025;27:e84120. A PRISMA-registered systematic review with narrative synthesis — it summarises and categorises existing benchmarks rather than generating primary clinical data, and the heterogeneity of those benchmarks prevented a pooled meta-analysis.


