Journal Club10 June 20265 min read

Typos, Homophones, Redactions: Which Kind of Messy Clinical Text Actually Breaks a Language Model

A benchmark fed three language models deliberately corrupted medical text. Misspellings and sound-alikes barely dented them; blacking words out did the damage. Useful — but it is synthetic noise on a test set, not evidence from a ward.

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hand typing on a discharge note, with a misspelled word as a faint smudge and a redacted word as a hard navy bar carrying a single amber dot.

Run the experiment in your head before reading the paper. You corrupt a few hundred clinical inputs three different ways — scramble characters, swap in sound-alike words, black out words entirely — feed them to three language models, and ask how often each model falls apart. Which corruption does the most harm? Most clinicians I ask guess typos. The honest answer, from a benchmark in JMIR AI, is the one that gets least attention in deployment: redaction, the very step we add to protect patients.

The question is not academic. A model in a clinic almost never meets clean prose. It meets abbreviations, autocorrect scars, copy-paste debris, and — more and more — text that a privacy step has stripped of names, dates and identifiers before the model sees a single token. If robustness to that mess is assumed rather than measured, the failure is quiet: a confident, wrong answer on a sentence a human would have squinted at.

The design, stated plainly

Three models were put on the bench: a GPT model, Meta's Llama, and BlueBERT, an encoder model pretrained on biomedical text. Each ran three tasks — sentiment classification, classifying the condition described in a medical abstract, and answering questions about clinical notes. The inputs were then degraded three ways at graded intensities: character-level typos (at 10, 30 and 50 percent of characters), homophone substitutions (10, 20, 30 percent of words), and redactions (10, 30, 50 percent of words). Crossed out, the grid comes to 270 experimental scenarios. None of this is a clinical trial: no patients, no outcomes, and the noise is synthetic — a script applied it, not a real ward.

What held up, and what did not

The first surprise is how little broke. Across the 270 scenarios, performance stayed stable in 113 (41.85 percent) and actually improved in 38 (14.07 percent) — so in 151 of them, more than half, degrading the input left the model no worse off or better. Performance fell in 104 (38.52 percent), and only 15 (5.56 percent) collapsed into what the authors call a catastrophic drop. The reflexive fear — one stray typo derails the whole thing — is simply not what the numbers say.

The second surprise is the ranking of harm. The authors report that typographical errors and homophone substitutions had relatively limited impact, while redaction produced a far more pronounced degradation, and the catastrophic failures cluster there: in their per-type tally, redaction accounted for the catastrophic outcomes that the other two perturbations did not. The logic is plain once said aloud. A misspelling still carries a signal — "Diabtes" sits one edit from its meaning — while a deleted word carries none. A model can repair a distorted clue; it cannot recover a clue that is gone.

A third result is worth keeping because it cuts against intuition: corruption is not monotonically bad. In one worked case, an abstract on carcinoma of the gallbladder — ground truth "Digestive System Disease" — was misclassified as "Neoplasm" when left untouched; perturbing the words "carcinoma" and "cancer" pushed the model off that wrong label and onto the right one. The noise removed a term that had been steering the model astray. Small, but it dents the assumption that cleaner input is always the safer input.

What the study does not license

Read it as the benchmark it is. The perturbations are synthetic — the authors flag this first among their limitations — and a routine that deletes random words is not a real pseudonymisation pipeline, a garbled dictation, or a patient's own free-text message. The outcome measured is task accuracy on text, not a decision about a patient, and nothing was scored against a clinical endpoint. "The answer got worse" is a property of the test set; it is not yet a documented harm to anyone.

The boundaries that matter for generalising are tight: three models, three tasks, English only, one purpose-built dataset. And the headline-friendly detail — that 12 of the 15 catastrophic drops came from the GPT model — should be handled with care. It invites a neat story (the strongest model is the most brittle), but fifteen events scattered across a 270-cell grid is thin ground for a verdict; treat it as a hypothesis. This is no knock on the work, which is peer-reviewed and names its own limits squarely. It is a caution that the headline not outrun the method.

“A misspelling still carries a signal; a deleted word carries none. A model can repair a distorted clue — it cannot recover a clue that is gone.”

The one inference that travels

What does carry beyond the test set is the tension the study makes concrete. The data-protection reflex is to strip identifying text before a model touches it — sound practice under the General Data Protection Regulation (GDPR). But redaction is precisely the perturbation under which these models faltered most. When that step removes load-bearing clinical words rather than only identifiers, it can be the exact condition in which the model is least reliable. That is not an argument against redaction. It is an argument for testing a system on the degraded input it will actually receive, redaction included, before trusting what comes out. Modest, and worth holding onto: measure robustness against real-world mess; do not assume it.

Source: Joshi S, Mehta M, Maniar S, Wang M, Singh VK. Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation. JMIR AI 2026;5:e83640 (published 20 February 2026). A peer-reviewed benchmark on synthetically perturbed text across three models and three tasks — useful as a robustness probe, but with no patients, no clinical outcomes and no claim to bedside performance. Funded in part by the Rutgers University School of Communication and Information; no conflicts of interest declared.

#Journal Club#Clinical AI#Large Language Models#Evidence-Based Medicine#Health Informatics

Typos, Homophones, Redactions: Which Kind of Messy Clinical Text Actually Breaks a Language Model

The design, stated plainly

What held up, and what did not

What the study does not license

The one inference that travels

Keep reading

The Jevons Paradox in Healthcare: Why Faster Doctors Are Not Better Doctors

Automation Bias at the Bedside: Why Edit Rates Near Zero Are a Warning Sign

Introducing AI in the Hospital: Why the Betriebsrat Has a Say

This analysis comes from the people behind Visite.

Want to see this in your hospital?