AI Agents Arrived Before the Evidence Did
At HIMSS 2026 the vendors shipped clinical AI agents faster than anyone could count them. A conference dispatch is not a study — but it shows exactly where the validation gap sits, and how a single word keeps the proof bar low.

Dr. Sven Jungmann
CEO

Count the agents launched at one health-IT conference and you run out of fingers before you run out of booths. That is the quiet finding in Casey Ross's dispatch from the HIMSS 2026 floor: Epic alone showed three — Art to draft documentation, Penny to chase billing and coverage denials, Emmie to field patient questions and book appointments. Oracle previewed an assistant meant to draft notes and suggest next steps across thirty specialties. Amazon, Google and Microsoft each added personas of their own. A trade show is not evidence, and reading it as such would be a category error. But it is a record of what the market has decided it can ship — and of the proof it has decided it can ship without.
Read it as reporting, not a result
So take the piece for what it is: reporting by an investigative journalist who has covered health AI for years, not a peer-reviewed finding. Its value is not a number; most of the article sits behind a paywall, and nothing worth keeping depends on the part I cannot read. The value is a structural observation that any careful reader can hold up against the regulatory map — and the subhead states it without ornament: patients are seldom consulted on how these tools are developed and tested.
What the dispatch documents is a shift in kind, not just in volume. The single-purpose model that classifies an image gives way to the orchestrated system that drafts a note, queries a record, flags a denial and routes a message — software that acts rather than merely predicts. What it does not document, and does not pretend to, is any prospective test of these agents against patient outcomes, or any sign that the buyers at the booths asked to see one.
One word does the regulatory work
The sharper question is not whether these tools are regulated but which ones slip the net by description. In the United States, software meets the Software as a Medical Device (SaMD) definition — and earns the pre-market scrutiny that follows — only when its purpose is to diagnose, treat, prevent or mitigate disease. Label a tool administrative or operational and it falls outside that line, and outside any duty to show clinical data before deployment. Reasonable enough for a scheduling assistant. Less reasonable when the same vocabulary stretches to cover a system that drafts the clinical note a physician will sign, or proposes the next step in care. The label, not the function, sets the level of proof — and the seller picks the label.
Europe holds no clear advantage. The Medizinprodukteverordnung (Medical Device Regulation, MDR) runs along the same definitional seam, and the EU-KI-Verordnung (EU AI Act) lays a risk-based regime over the top without settling when an assistant becomes a device. An agent filed under administrative can occupy the same grey zone on either side of the Atlantic.
“The label, not the function, sets the level of proof — and the seller picks the label.”
Why scale should raise the bar, not lower it
There is a reason the unit of deployment matters. A single clinician's error rate is bounded by how many patients that clinician sees. An agent's is bounded by nothing local: the same blind spot, the same quiet omission reproduces across every interaction, thousands a day, in one identical voice. That is not an argument against the technology. It is an argument that the evidentiary bar should rise, not fall, as the thing being deployed moves from a person to a system speaking to an entire population at once.
What discipline looks like when someone bothers
It helps to set a counter-example beside the trade-show noise, precisely because it is the exception. A hepatology decision-support system published in Frontiers in Medicine in December 2025 was built to test whether the information it had retrieved was good enough to answer on — a primary and a secondary model each voting independently, with a non-model relevance score breaking ties, before the system either answered or went back to refine the query. Two hepatologists scored its safety at 4.9 against 4.1 for standalone GPT-4 on a five-point scale. The caveat the authors own up to: those scores come from thirty simulated clinical questions, not patients, and the test set is small — so the result speaks to design discipline, not bedside performance. But that instinct, building a system that checks whether it should answer before it does, is exactly the engineering a launch optimised for capability has no reason to reward. The distance between the two is the whole story.
Source: Ross C. AI agents are rapidly spreading in health care, but validation is lacking. STAT News, 11 March 2026. This is journalism — a reported dispatch from a conference floor, largely paywalled, not primary research; its worth is the structural observation, which holds independently of any single product claim.


