Skip to main content
Journal Club5 min read

When the Score Jumps Tenfold: Was It the Training, or the Tool?

A study of 326 physicians found pass rates on a clinical-reasoning test rose from 6.4 to 58.6 percent after a 90-minute AI course. The number is real. The design cannot tell us how much of it was the teaching and how much was simply being handed GPT-4.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a physician's hand and a flat teal hand sharing one pen over a clinical note, with a two-step torn-paper bar chart and a single amber accent in the gap.

Twenty-one of 326 doctors passed a clinical-reasoning test on the first attempt. After a ninety-minute online course, 191 of them passed. A pass rate of 6.4 percent became 58.6 percent — a near-tenfold jump from a single afternoon's intervention, and exactly the kind of figure that ends up on a slide with no asterisk. The study it comes from, published on 5 February 2026 in JMIR Medical Education, is worth reading because the figure is genuine and the methods quietly warn you against reading it the way the slide invites.

The strength worth naming first

Most education studies of this sort are small and parochial: a handful of enthusiasts at one teaching hospital. This one is not. It enrolled 326 general practitioners and internists from 39 countries — with the largest cohorts from Saudi Arabia, Syria, Egypt, Algeria and Jordan — and 308 of them, 94.5 percent, had no prior structured training in using AI. The authors also did something methodologically careful: rather than reuse one test, they built two validated, comparable question sets and swapped them between groups before and after the course, so each physician served as their own control against the noise of individual ability. That within-subject crossover is a real design strength. It is just not the strength the headline needs.

What was measured, and how

The intervention was a 1.5-hour asynchronous online workshop on using AI effectively in clinical reasoning. Around it sat two tests covering three domains — diagnosis, treatment planning and patient counselling. The detail that governs the whole appraisal is in the second test: it permitted AI assistance, specifically GPT-4.0, with participants instructed to judge the model's output rather than copy it. The baseline test allowed no such help. So the contrast is not the unaided doctor before and after a course. It is the unaided doctor against a doctor who has been both taught and handed a capable language model for the exam.

What the evidence supports

Read within that frame, the gains are large and consistent. Mean scores rose from 56.9 to 77.6 percent, about 21 percentage points (P<.001). All three domains improved significantly (P<.001 throughout), with the strongest effects in diagnosis (r=0.738) and treatment planning (r=0.686) and a more modest one in patient counselling (r=0.420). General practitioners gained more than internists — 23.7 against 13.7 percentage points. Age tracked only weakly with the size of the gain (ρ=–0.143). Most usefully, prior familiarity with AI made no significant difference: the benefit did not require a head start.

What it does not support

The clever crossover controls for who the physicians are. It does nothing for what changed between the two sittings. There is no non-intervention arm — no group that took the second test with the tool but without the course, or with the course but without the tool. Three things therefore moved together and cannot be told apart: learning from the workshop, the raw assistance of GPT-4, and ordinary familiarity with a second, similar exam. The headline credits the teaching. The data can credit, at most, 'trained physician plus model' — and a clinician's instinct is that a capable model answering structured multiple-choice questions is carrying a fair share of that load on its own.

The crossover controls for who the physicians are. It does nothing for what changed between the two sittings.

The authors are admirably plain about this. Their own limitations section states that the design "does not allow isolation of the effect of the tailored AI training course from AI use without training." They add a second caveat that cuts the other way: the people who volunteered for a time-intensive AI study were probably warmer to AI than physicians at large, so the effect could be either inflated by selection or, as they argue, an underestimate of what a less-receptive colleague might gain. Either way, the result may not transfer to the sceptic down the corridor. And the outcome was a test score, not care delivered. A higher mark on assessment scenarios is not a shorter diagnostic odyssey, a correct prescription or a patient spared harm; nobody was followed forward, and the durability of the gain is unknown.

Why it matters here

None of this makes the study unimportant — it makes its lesson narrower and, arguably, more honest than the number suggests. What it genuinely shows is that physicians given a short, structured orientation use a general-purpose model on clinical-reasoning tasks markedly better than they perform unaided, and that this holds across countries and across levels of prior exposure. For continuing medical education that is a real signal: these tools are arriving in clinics whether or not anyone teaches their use, and a brief, low-cost, fully online orientation measurably changes how they are used — on paper. The question the study cannot close is the one that decides whether any of this belongs in a curriculum: how much of the measured competence travels from the test into the consultation, how long it lasts, and how much survives once you account for simply having the model in the room.

Source: Qunaibi EA, Al-Qaaneh AM, Ismail BF, et al. Effectiveness of Informed AI Use on Clinical Competence of General Practitioners and Internists: Pre-Post Intervention Study. JMIR Medical Education 2026;12:e75534. A single-cohort, within-subject crossover with no non-intervention control arm, in self-selected volunteers, measuring test performance rather than patient outcomes; the post-test permitted AI assistance and the baseline did not.

#Journal Club#Medical Education#Clinical AI#Evidence-Based Medicine#Continuing Professional Development

Keep reading

Editorial collage of a confident stack of clinical document fragments bound by a teal bracket that stops at a closed ward door, with a single amber accent.
Journal Club

Sixty-Five Studies Agree the Models Win. The Ward Hasn't Noticed.

A PRISMA review of 65 studies finds language models consistently beat classical methods at classifying clinical text. The honest reading is narrower: it is a synthesis of single-site accuracy studies that mostly never asked whether the models work at the bedside.

Dr. Sven JungmannCEO
Editorial collage of a clinical summary sheet torn down the middle, one half framed by a teal speech bubble and the other by a navy clipboard, with a single amber dot on the tear line.
Journal Club

Two Readers, One Summary: Who Should Grade Patient-Facing AI?

A small Stanford study had clinicians and parents rate the same AI-written clinical summaries. They disagreed, significantly — and that disagreement, not the scores, is the finding worth keeping.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.