The Most Predictive Variable Was Missing From Three of Four Records
In a real colorectal-cancer dataset, the single most prognostic variable — tumour stage — was absent from 75 percent of records, and half the rest were miscoded. A quiet, careful paper on why a model can only learn what the data actually contain.

Dr. Sven Jungmann
CEO

Some failures in clinical machine learning are not in the algorithm. They are in a database field that nobody ever filled in. A team in Incheon, working with a real registry of 6,491 colorectal-cancer patients treated at Gachon University Gil Medical Center between 2010 and 2022, went looking for the failure and found it sitting in plain sight: the variable that matters most for predicting how a colorectal-cancer patient will fare — the tumour stage — was simply missing from three of every four records.
Their report, published on 13 November 2025 in JMIR Medical Informatics and funded by Korea's National Research Foundation with no declared competing interests, is not about a clever model. It is about the column the model never gets to read. It earns its hour precisely because it declines to be exciting.
Two numbers that frame everything
Stage in colorectal cancer is captured two ways. The detailed TNM classification (tumour, node, metastasis) is missing from 75.3 percent of records. The coarser SEER summary stage — localised, regional, distant — is missing from a far gentler 24.3 percent. The gap between those two figures is the whole story: the more granular and prognostically useful the variable, the less reliably it had been recorded. And of the TNM entries that did exist, half were wrong — 43 coding errors among 86 non-missing cases. The SEER entries fared somewhat better, with 47 errors in 151 cases, a 31.1 percent error rate. Neither field, in other words, could be trusted as it stood.
What they actually did
The authors built a rules-based quality-management process and ran it across the dataset in four stages — planning, identification, operation, evaluation. The work that counts happened in operation. Much of the missing staging was not truly absent from the hospital: it existed in free text, in pathology and imaging reports, and had simply never been transcribed into the structured field an algorithm reads. So the team wrote an automatic staging library — keyword rules that parse those reports and assign T, N and M categories. Against manual coding on 164 randomly drawn cases, the automated assignments agreed 93.3 percent of the time for TNM and 93.9 percent for the SEER summary stage. That is a respectable concordance, but it should be read with one eye on its reference: the human coding it is compared against is the same coding this paper found to be 50 percent wrong for TNM. The automation is being graded against an imperfect marker, not a gold standard. And this is a single-centre, retrospective case study, not a trial.
What the evidence supports
The cleanup worked on its own terms. Missing TNM data fell from 75.3 to 35.7 percent, and missing SEER stage from 24.3 to 18.5 percent. The prognostic model's headline metric — its AUROC (area under the receiver-operating-characteristic curve, a measure of how well it separates outcomes) — rose from 0.856 to 0.872. That is a small gain, the kind that is easy to dress up and shouldn't be.
The result worth dwelling on is not the AUROC. Before the cleanup, feature selection did not rank TNM stage among the model's important variables at all — the single most prognostic fact in colorectal cancer was invisible to the algorithm, because it was missing or wrong too often to be learnable. After the cleanup, TNM and its component T, N and M codes emerged as significant. The model had not been weak. It had been blind to the thing that matters, because the data did not reliably contain it.
“A model can only learn from a variable that is reliably present and correct. The cleanup did not make the algorithm smarter; it made the most important fact visible to it for the first time.”
Where the claims stop
This is not a solved problem, and the authors do not pretend otherwise. After the full process, more than a third of TNM values are still missing — 35.7 percent is progress on 75.3, not a clean dataset. They are explicit that they offer no general remedy for data errors, and that staging in the real world is hard to reconstruct for clinical rather than clerical reasons: neoadjuvant treatment, surgical findings and multidisciplinary judgement can all leave the recorded stage genuinely ambiguous. Keyword rules tuned to one hospital's reporting style, in one language, carry no guarantee of travelling. And the study's endpoints are data completeness and a model metric — not a single patient outcome. A better-populated staging column is a precondition for trustworthy modelling; it is not evidence that anyone was treated better. That is the correct scope for a methods paper, and a reader should keep it there.
Why it matters here
When a predictive model disappoints, the reflex is to reach for a better model. This paper makes the patient, unglamorous case for looking first at the denominator — at how completely and correctly the decisive variables were ever recorded. The Korean specifics do not transfer, but the structural problem does. Any institution building on retrospective records, German tumour registries included for all their legal grounding, inherits the same gap between what a clinician wrote in a report and what a structured field actually holds. Getting the staging column right is not preparation for the analysis. On this evidence, it is most of the analysis.
Source: Park N, Na K, Sunwoo W, Baek JH, Lee Y, Lee S, Woo H. Process for Quality Management of Electronic Medical Records-Based Data: Case Study Using Real Colorectal Cancer Data. JMIR Medical Informatics 2025;13:e73884. A single-centre, retrospective case study; its endpoints are data completeness and a prognostic-model metric, not patient outcomes, and more than a third of staging values remain missing after the process.


