Skip to main content
Journal Club5 min read

GPT-5 Reads the PET Scan Confidently. It Just Misses the Cancer That Has Spread.

A single-centre pilot put GPT-5 against radiologists on PET staging of oesophageal cancer. Its overall accuracy looks respectable. Its sensitivity for the metastases that change the treatment plan does not — and that is the number that decides whether a tool is usable.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's eye leaning toward a navy PET-scan silhouette, with a confident teal circle over an empty region and a single amber dot marking a missed node.

Thirty-five of the 120 patients in this study had cancer in the lymph nodes of their abdomen. GPT-5, reading their PET scans, identified five of them. The radiologist of record identified twenty. Yet on the same task GPT-5 lands a 73 percent overall accuracy — a figure most people would read as 'almost good enough'. Both numbers are true. Only one of them tells you whether the tool is safe to use.

The distance between those two numbers is the entire point of this paper, and it is one of the most reliable ways a medical-AI result gets misread in a board meeting.

The study, and where it sits

A single-centre retrospective pilot — the lowest rung of the evidence ladder, which the authors state plainly. Maruyama and colleagues at Tohoku University Hospital took 120 patients with biopsy-proven oesophageal squamous-cell carcinoma who had an [18F]FDG-PET/CT between January 2019 and December 2021. Each scan was reduced to a single standardised frontal maximum-intensity-projection (MIP) image. Six large language models — GPT-5, GPT-4.5, GPT-4.1, OpenAI-o3 and -o1, and GPT-4 Turbo — were asked to stage the disease from that one image plus the tumour location. The comparison group was four blinded clinicians: a nuclear-medicine specialist with 14 years' experience, a gastrointestinal surgeon with nine, and two radiology residents. The reference standard was the certified radiology report under the 8th-edition UICC TNM staging system. The authors followed the CLAIM checklist for AI in medical imaging — more methodological discipline than this corner of the literature usually brings.

What it gets right

Read on overall accuracy alone, GPT-5 was the best of the models and not embarrassingly behind the radiologist: 63 percent versus 78 for thoracic nodes, 73 versus 80 for abdominal nodes, 48 versus 58 for clinical N-stage, 77 versus 78 for M-stage. The gaps were statistically significant for the nodal and N-stage tasks (P from .002 to <.001) but not for M-stage (P = .052). Newer models beat older ones with reasonable consistency, which is a genuine signal that multimodal training is improving — worth stating without irony.

GPT-5 was also strongly specific: it correctly cleared patients who had no metastases 94 to 98 percent of the time. A model that almost never raises a false alarm has a real, narrow use — as a second reader for ruling things out — provided everyone knows that ruling-out is the only thing it does reliably.

What it cannot do

Start with the metric that refuses to be flattered. The Matthews correlation coefficient rewards a model only for genuine discrimination, not for coasting on a majority of negative cases. For GPT-5 it was 0.32 on thoracic nodes against the radiologist's 0.57, 0.20 against 0.48 on abdominal nodes, and 0.04 against 0.28 on M-stage. Across all tasks the physicians scored between 0.28 and 0.75; the language models scored between –0.07 and 0.32. That is the honest summary the headline accuracy hides.

The reason is sensitivity — the share of genuinely affected patients the model catches. It was 31 percent for thoracic nodes against the radiologist's 84, 14 percent for abdominal nodes against 57, and 4 percent for distant (M) disease against 33. With 35 of 120 patients carrying abdominal nodal metastases and only 27 carrying M1 disease, the dataset is heavily weighted toward negatives, and a model that mostly says 'no spread' will look accurate while detecting almost nothing. The 73 percent is class imbalance doing the work, not discrimination.

The high accuracy is class imbalance doing the work, not discrimination — which is exactly why a single accuracy figure should never close the argument.

This is the number that matters because of what staging decides. In oesophageal cancer it separates neoadjuvant chemoradiotherapy from primary resection. Miss nodal or distant spread and a patient may be sent into an oesophagectomy — among the most invasive operations in oncology — that cannot cure them, or be denied the chemoradiotherapy that would have changed their prognosis. A tool that misses six of every seven patients with abdominal metastases is not a marginal aid for this decision; it is the wrong instrument, whatever the headline accuracy says.

Two limitations keep the appraisal fair to the design. The model saw one MIP image — a flat projection, not the volumetric slices a radiologist actually reads — with no SUVmax or any quantitative metabolic value, so this measures a deliberately stripped-down version of the task rather than PET reading as practised. And in a small post-hoc subanalysis the output proved unstable: in one case GPT-5 had correctly flagged a thoracic node, then called the same case negative once the prompt was nudged toward asking for its reasoning; in another it reached the right N-stage while hallucinating an abdominal node in its rationale. Two cases are not a reliability statistic, but they point to a stochastic property of these systems that better prompting does not fully remove.

The reading that matters

The pressure behind the study is familiar to European systems too — rising imaging volumes against a shrinking pool of radiologists — and it is exactly that pressure that makes a 73-percent figure tempting to read as 'nearly there'. This paper is a clean demonstration of why that reading fails, and the lesson travels well past PET and past GPT-5: for any screening-like task on imbalanced data, ask for sensitivity and a balanced metric before you let an accuracy number into the room. The authors put the verdict where it belongs — current general-purpose models do not reach physician-level accuracy and their sensitivity for nodal and distant metastases is insufficient for clinical use. Read that as a list of what would have to change — volumetric input, quantitative parameters, multi-centre validation, some way to verify where the model is looking — not as a prediction that it never will.

Source: Maruyama H, Toyama Y, Araki Y, et al. Evaluation of GPT-5 for Esophageal Cancer Staging Using FDG-PET Maximum-Intensity Projection Images: Comparative Pilot Study. JMIR Cancer 2026;12:e86630. A single-centre retrospective pilot on 120 cases from one institution — informative about a constrained task, not generalisable, and not designed to measure any patient outcome.

#Journal Club#Clinical AI#Medical Imaging#Evidence-Based Medicine#Large Language Models

Keep reading

Editorial collage of a smartphone with a blank teal screen lying on an empty hospital bedside table, with a single amber accent at the screen's edge.
Journal Club

The Best App in the World, and No One on the Ward to Use It

Twenty clinicians explain why good mental-health apps never reach patients. The obstacle is almost never the technology. It is whose job it is to introduce the tool, watch the alerts, and answer when something looks wrong — questions no software answers.

Dr. Sven JungmannCEO
Editorial collage of an older person's wrist with a plain band rendered as a teal arc, faint activity waveforms below, and one amber dot marking a single external validation link.
Journal Club

Wearables and Dementia: A Strong Signal on Thin Validation

Forty-nine studies suggest disturbed sleep and activity shadow cognitive decline by years. Only three tested their model outside the lab that built it. The signal is real; the case that it works as a screening tool is not yet made.

Dr. Sven JungmannCEO
Editorial collage of a recovery-room patient's hand on a bedrail framed by a teal circle, with twenty-eight uneven navy bars behind it and one amber stripe standing apart.
Journal Club

An AUROC of 0.805, Sitting on 97 Percent Heterogeneity

Twenty-eight machine-learning models claim to predict delirium after heart surgery. Pooled, they look clinically useful. Read the validation methods and the heterogeneity, and the single number stops meaning what it appears to.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.