Skip to main content
Journal Club5 min read

Typos, Homophones, Redactions: Which Kind of Messy Clinical Text Actually Breaks a Language Model

A benchmark fed three language models deliberately corrupted medical text. Misspellings and sound-alikes barely dented them; blacking words out did the damage. Useful — but it is synthetic noise on a test set, not evidence from a ward.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hand typing on a discharge note, with a misspelled word as a faint smudge and a redacted word as a hard navy bar carrying a single amber dot.

Run the experiment in your head before reading the paper. You corrupt a few hundred clinical inputs three different ways — scramble characters, swap in sound-alike words, black out words entirely — feed them to three language models, and ask how often each model falls apart. Which corruption does the most harm? Most clinicians I ask guess typos. The honest answer, from a benchmark in JMIR AI, is the one that gets least attention in deployment: redaction, the very step we add to protect patients.

The question is not academic. A model in a clinic almost never meets clean prose. It meets abbreviations, autocorrect scars, copy-paste debris, and — more and more — text that a privacy step has stripped of names, dates and identifiers before the model sees a single token. If robustness to that mess is assumed rather than measured, the failure is quiet: a confident, wrong answer on a sentence a human would have squinted at.

The design, stated plainly

Three models were put on the bench: a GPT model, Meta's Llama, and BlueBERT, an encoder model pretrained on biomedical text. Each ran three tasks — sentiment classification, classifying the condition described in a medical abstract, and answering questions about clinical notes. The inputs were then degraded three ways at graded intensities: character-level typos (at 10, 30 and 50 percent of characters), homophone substitutions (10, 20, 30 percent of words), and redactions (10, 30, 50 percent of words). Crossed out, the grid comes to 270 experimental scenarios. None of this is a clinical trial: no patients, no outcomes, and the noise is synthetic — a script applied it, not a real ward.

What held up, and what did not

The first surprise is how little broke. Across the 270 scenarios, performance stayed stable in 113 (41.85 percent) and actually improved in 38 (14.07 percent) — so in 151 of them, more than half, degrading the input left the model no worse off or better. Performance fell in 104 (38.52 percent), and only 15 (5.56 percent) collapsed into what the authors call a catastrophic drop. The reflexive fear — one stray typo derails the whole thing — is simply not what the numbers say.

The second surprise is the ranking of harm. The authors report that typographical errors and homophone substitutions had relatively limited impact, while redaction produced a far more pronounced degradation, and the catastrophic failures cluster there: in their per-type tally, redaction accounted for the catastrophic outcomes that the other two perturbations did not. The logic is plain once said aloud. A misspelling still carries a signal — "Diabtes" sits one edit from its meaning — while a deleted word carries none. A model can repair a distorted clue; it cannot recover a clue that is gone.

A third result is worth keeping because it cuts against intuition: corruption is not monotonically bad. In one worked case, an abstract on carcinoma of the gallbladder — ground truth "Digestive System Disease" — was misclassified as "Neoplasm" when left untouched; perturbing the words "carcinoma" and "cancer" pushed the model off that wrong label and onto the right one. The noise removed a term that had been steering the model astray. Small, but it dents the assumption that cleaner input is always the safer input.

What the study does not license

Read it as the benchmark it is. The perturbations are synthetic — the authors flag this first among their limitations — and a routine that deletes random words is not a real pseudonymisation pipeline, a garbled dictation, or a patient's own free-text message. The outcome measured is task accuracy on text, not a decision about a patient, and nothing was scored against a clinical endpoint. "The answer got worse" is a property of the test set; it is not yet a documented harm to anyone.

The boundaries that matter for generalising are tight: three models, three tasks, English only, one purpose-built dataset. And the headline-friendly detail — that 12 of the 15 catastrophic drops came from the GPT model — should be handled with care. It invites a neat story (the strongest model is the most brittle), but fifteen events scattered across a 270-cell grid is thin ground for a verdict; treat it as a hypothesis. This is no knock on the work, which is peer-reviewed and names its own limits squarely. It is a caution that the headline not outrun the method.

A misspelling still carries a signal; a deleted word carries none. A model can repair a distorted clue — it cannot recover a clue that is gone.

The one inference that travels

What does carry beyond the test set is the tension the study makes concrete. The data-protection reflex is to strip identifying text before a model touches it — sound practice under the General Data Protection Regulation (GDPR). But redaction is precisely the perturbation under which these models faltered most. When that step removes load-bearing clinical words rather than only identifiers, it can be the exact condition in which the model is least reliable. That is not an argument against redaction. It is an argument for testing a system on the degraded input it will actually receive, redaction included, before trusting what comes out. Modest, and worth holding onto: measure robustness against real-world mess; do not assume it.

Source: Joshi S, Mehta M, Maniar S, Wang M, Singh VK. Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation. JMIR AI 2026;5:e83640 (published 20 February 2026). A peer-reviewed benchmark on synthetically perturbed text across three models and three tasks — useful as a robustness probe, but with no patients, no clinical outcomes and no claim to bedside performance. Funded in part by the Rutgers University School of Communication and Information; no conflicts of interest declared.

#Journal Club#Clinical AI#Large Language Models#Evidence-Based Medicine#Health Informatics

Keep reading

Editorial collage of a hand holding a phone whose screen is full of halftone advice fragments, with a small clinician figure at the frame's edge and a single amber line.
Journal Club

Four Percent: Who Actually Answers a Patient's Skin Question

A viewpoint in JMIR Dermatology argues that for millions, the first dermatological opinion arrives by phone — from someone with no clinical training. It has no new data of its own, but it reads the field honestly.

Dr. Sven JungmannCEO
Editorial collage of a tired junior doctor at a workstation glancing past a teal alert window, with a receding halftone row of identical grey alerts and one amber accent.
Journal Club

Alert Fatigue Is a Continuum, Not a Switch: A Closer Read

Twenty junior doctors describe how clinical alerts stop being read. The useful finding is not that they ignore warnings — it is that fatigue is a moving equilibrium shaped by culture and design, not a fixed trait you can configure away.

Dr. Sven JungmannCEO
Editorial collage of a tired person at night lit by a blue phone screen, an erratic teal eye-movement line across a navy rectangle, faint empty diary fields below, and one amber dot in a single field.
Journal Club

The Sleep Diary That Fights the Sleep-Deprived Brain

A small eye-tracking pilot makes an uncomfortable point: the people asked to keep a precise sleep diary are the ones whose attention the poor sleep has already eroded. The interface is not neutral — but this is a pilot, and it measured strain, not cure.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.