When the Score Jumps Tenfold: Was It the Training, or the Tool?
A study of 326 physicians found pass rates on a clinical-reasoning test rose from 6.4 to 58.6 percent after a 90-minute AI course. The number is real. The design cannot tell us how much of it was the teaching and how much was simply being handed GPT-4.

Dr. Sven Jungmann
CEO

Twenty-one of 326 doctors passed a clinical-reasoning test on the first attempt. After a ninety-minute online course, 191 of them passed. A pass rate of 6.4 percent became 58.6 percent — a near-tenfold jump from a single afternoon's intervention, and exactly the kind of figure that ends up on a slide with no asterisk. The study it comes from, published on 5 February 2026 in JMIR Medical Education, is worth reading because the figure is genuine and the methods quietly warn you against reading it the way the slide invites.
The strength worth naming first
Most education studies of this sort are small and parochial: a handful of enthusiasts at one teaching hospital. This one is not. It enrolled 326 general practitioners and internists from 39 countries — with the largest cohorts from Saudi Arabia, Syria, Egypt, Algeria and Jordan — and 308 of them, 94.5 percent, had no prior structured training in using AI. The authors also did something methodologically careful: rather than reuse one test, they built two validated, comparable question sets and swapped them between groups before and after the course, so each physician served as their own control against the noise of individual ability. That within-subject crossover is a real design strength. It is just not the strength the headline needs.
What was measured, and how
The intervention was a 1.5-hour asynchronous online workshop on using AI effectively in clinical reasoning. Around it sat two tests covering three domains — diagnosis, treatment planning and patient counselling. The detail that governs the whole appraisal is in the second test: it permitted AI assistance, specifically GPT-4.0, with participants instructed to judge the model's output rather than copy it. The baseline test allowed no such help. So the contrast is not the unaided doctor before and after a course. It is the unaided doctor against a doctor who has been both taught and handed a capable language model for the exam.
What the evidence supports
Read within that frame, the gains are large and consistent. Mean scores rose from 56.9 to 77.6 percent, about 21 percentage points (P<.001). All three domains improved significantly (P<.001 throughout), with the strongest effects in diagnosis (r=0.738) and treatment planning (r=0.686) and a more modest one in patient counselling (r=0.420). General practitioners gained more than internists — 23.7 against 13.7 percentage points. Age tracked only weakly with the size of the gain (ρ=–0.143). Most usefully, prior familiarity with AI made no significant difference: the benefit did not require a head start.
What it does not support
The clever crossover controls for who the physicians are. It does nothing for what changed between the two sittings. There is no non-intervention arm — no group that took the second test with the tool but without the course, or with the course but without the tool. Three things therefore moved together and cannot be told apart: learning from the workshop, the raw assistance of GPT-4, and ordinary familiarity with a second, similar exam. The headline credits the teaching. The data can credit, at most, 'trained physician plus model' — and a clinician's instinct is that a capable model answering structured multiple-choice questions is carrying a fair share of that load on its own.
“The crossover controls for who the physicians are. It does nothing for what changed between the two sittings.”
The authors are admirably plain about this. Their own limitations section states that the design "does not allow isolation of the effect of the tailored AI training course from AI use without training." They add a second caveat that cuts the other way: the people who volunteered for a time-intensive AI study were probably warmer to AI than physicians at large, so the effect could be either inflated by selection or, as they argue, an underestimate of what a less-receptive colleague might gain. Either way, the result may not transfer to the sceptic down the corridor. And the outcome was a test score, not care delivered. A higher mark on assessment scenarios is not a shorter diagnostic odyssey, a correct prescription or a patient spared harm; nobody was followed forward, and the durability of the gain is unknown.
Why it matters here
None of this makes the study unimportant — it makes its lesson narrower and, arguably, more honest than the number suggests. What it genuinely shows is that physicians given a short, structured orientation use a general-purpose model on clinical-reasoning tasks markedly better than they perform unaided, and that this holds across countries and across levels of prior exposure. For continuing medical education that is a real signal: these tools are arriving in clinics whether or not anyone teaches their use, and a brief, low-cost, fully online orientation measurably changes how they are used — on paper. The question the study cannot close is the one that decides whether any of this belongs in a curriculum: how much of the measured competence travels from the test into the consultation, how long it lasts, and how much survives once you account for simply having the model in the room.
Source: Qunaibi EA, Al-Qaaneh AM, Ismail BF, et al. Effectiveness of Informed AI Use on Clinical Competence of General Practitioners and Internists: Pre-Post Intervention Study. JMIR Medical Education 2026;12:e75534. A single-cohort, within-subject crossover with no non-intervention control arm, in self-selected volunteers, measuring test performance rather than patient outcomes; the post-test permitted AI assistance and the baseline did not.


