GPT-5 Reads the PET Scan Confidently. It Just Misses the Cancer That Has Spread.
A single-centre pilot put GPT-5 against radiologists on PET staging of oesophageal cancer. Its overall accuracy looks respectable. Its sensitivity for the metastases that change the treatment plan does not — and that is the number that decides whether a tool is usable.

Dr. Sven Jungmann
CEO

Thirty-five of the 120 patients in this study had cancer in the lymph nodes of their abdomen. GPT-5, reading their PET scans, identified five of them. The radiologist of record identified twenty. Yet on the same task GPT-5 lands a 73 percent overall accuracy — a figure most people would read as 'almost good enough'. Both numbers are true. Only one of them tells you whether the tool is safe to use.
The distance between those two numbers is the entire point of this paper, and it is one of the most reliable ways a medical-AI result gets misread in a board meeting.
The study, and where it sits
A single-centre retrospective pilot — the lowest rung of the evidence ladder, which the authors state plainly. Maruyama and colleagues at Tohoku University Hospital took 120 patients with biopsy-proven oesophageal squamous-cell carcinoma who had an [18F]FDG-PET/CT between January 2019 and December 2021. Each scan was reduced to a single standardised frontal maximum-intensity-projection (MIP) image. Six large language models — GPT-5, GPT-4.5, GPT-4.1, OpenAI-o3 and -o1, and GPT-4 Turbo — were asked to stage the disease from that one image plus the tumour location. The comparison group was four blinded clinicians: a nuclear-medicine specialist with 14 years' experience, a gastrointestinal surgeon with nine, and two radiology residents. The reference standard was the certified radiology report under the 8th-edition UICC TNM staging system. The authors followed the CLAIM checklist for AI in medical imaging — more methodological discipline than this corner of the literature usually brings.
What it gets right
Read on overall accuracy alone, GPT-5 was the best of the models and not embarrassingly behind the radiologist: 63 percent versus 78 for thoracic nodes, 73 versus 80 for abdominal nodes, 48 versus 58 for clinical N-stage, 77 versus 78 for M-stage. The gaps were statistically significant for the nodal and N-stage tasks (P from .002 to <.001) but not for M-stage (P = .052). Newer models beat older ones with reasonable consistency, which is a genuine signal that multimodal training is improving — worth stating without irony.
GPT-5 was also strongly specific: it correctly cleared patients who had no metastases 94 to 98 percent of the time. A model that almost never raises a false alarm has a real, narrow use — as a second reader for ruling things out — provided everyone knows that ruling-out is the only thing it does reliably.
What it cannot do
Start with the metric that refuses to be flattered. The Matthews correlation coefficient rewards a model only for genuine discrimination, not for coasting on a majority of negative cases. For GPT-5 it was 0.32 on thoracic nodes against the radiologist's 0.57, 0.20 against 0.48 on abdominal nodes, and 0.04 against 0.28 on M-stage. Across all tasks the physicians scored between 0.28 and 0.75; the language models scored between –0.07 and 0.32. That is the honest summary the headline accuracy hides.
The reason is sensitivity — the share of genuinely affected patients the model catches. It was 31 percent for thoracic nodes against the radiologist's 84, 14 percent for abdominal nodes against 57, and 4 percent for distant (M) disease against 33. With 35 of 120 patients carrying abdominal nodal metastases and only 27 carrying M1 disease, the dataset is heavily weighted toward negatives, and a model that mostly says 'no spread' will look accurate while detecting almost nothing. The 73 percent is class imbalance doing the work, not discrimination.
“The high accuracy is class imbalance doing the work, not discrimination — which is exactly why a single accuracy figure should never close the argument.”
This is the number that matters because of what staging decides. In oesophageal cancer it separates neoadjuvant chemoradiotherapy from primary resection. Miss nodal or distant spread and a patient may be sent into an oesophagectomy — among the most invasive operations in oncology — that cannot cure them, or be denied the chemoradiotherapy that would have changed their prognosis. A tool that misses six of every seven patients with abdominal metastases is not a marginal aid for this decision; it is the wrong instrument, whatever the headline accuracy says.
Two limitations keep the appraisal fair to the design. The model saw one MIP image — a flat projection, not the volumetric slices a radiologist actually reads — with no SUVmax or any quantitative metabolic value, so this measures a deliberately stripped-down version of the task rather than PET reading as practised. And in a small post-hoc subanalysis the output proved unstable: in one case GPT-5 had correctly flagged a thoracic node, then called the same case negative once the prompt was nudged toward asking for its reasoning; in another it reached the right N-stage while hallucinating an abdominal node in its rationale. Two cases are not a reliability statistic, but they point to a stochastic property of these systems that better prompting does not fully remove.
The reading that matters
The pressure behind the study is familiar to European systems too — rising imaging volumes against a shrinking pool of radiologists — and it is exactly that pressure that makes a 73-percent figure tempting to read as 'nearly there'. This paper is a clean demonstration of why that reading fails, and the lesson travels well past PET and past GPT-5: for any screening-like task on imbalanced data, ask for sensitivity and a balanced metric before you let an accuracy number into the room. The authors put the verdict where it belongs — current general-purpose models do not reach physician-level accuracy and their sensitivity for nodal and distant metastases is insufficient for clinical use. Read that as a list of what would have to change — volumetric input, quantitative parameters, multi-centre validation, some way to verify where the model is looking — not as a prediction that it never will.
Source: Maruyama H, Toyama Y, Araki Y, et al. Evaluation of GPT-5 for Esophageal Cancer Staging Using FDG-PET Maximum-Intensity Projection Images: Comparative Pilot Study. JMIR Cancer 2026;12:e86630. A single-centre retrospective pilot on 120 cases from one institution — informative about a constrained task, not generalisable, and not designed to measure any patient outcome.


