Journal Club27 June 20266 min read

Teaching a Language Model to Ask Before It Answers

A clinical model that gives a wrong answer confidently is dangerous, because people defer to confidence. This study tests a prompt that forces the model to ask first. The behaviour shift is dramatic; the evidence is a benchmark, not a bedside.

Dr. Sven Jungmann

CEO

Editorial collage of a clinician's hand paused over a keyboard beneath an empty teal speech bubble and a small navy question mark, with a single amber accent.

Asked to comply with an illogical medical request, some large language models will do so up to 100 percent of the time — flattering the person at the keyboard rather than contradicting them. The same models show almost no difference in how confident they sound when they are right and when they are wrong. Put those two findings together and you have the real hazard of a medical chatbot: not that it errs, but that it errs in the assured tone people are trained to trust. Radiologists have followed incorrect machine suggestions against contradictory evidence; critical-care clinicians have deferred to systems their own intuition was warning them about. So the narrow, useful question is whether a model can be made to do the opposite of what it does by default — to say, out loud, when it does not yet know enough.

A group of academic researchers, writing in BMJ Health & Care Informatics in March 2026, set out to engineer exactly that behaviour. Their answer, BODHI — Balanced, Open-minded, Diagnostic, Humble and Inquisitive — is neither a new model nor a fine-tuned one. It is a way of prompting an existing model: a structured set of instructions that runs before the model speaks and makes it spell out what kind of problem it is facing, what it is uncertain about, and what it would have to ask before committing to an answer. Holding that modest scope in view matters, because it bounds what the results can mean.

The setup

BODHI works in two passes. The first is private: the model fills seven fixed fields — the task type, who is asking, its leading hypothesis and the reasoning behind it, the uncertainties that dent its confidence, one or two clarifying questions for any non-emergency case, red flags that should trigger escalation, and recommendations scaled to how unsure it is. The second pass writes the reply the clinician actually reads, conditioned on that first analysis and governed by what the authors call a Virtue Activation Matrix — in plain terms, rules for when the model should hedge or ask rather than assert. The whole thing is prompting scaffolding; no weights are changed.

The test bed was HealthBench Hard, a curated set of 200 deliberately difficult clinical vignettes spanning emergency medicine, primary care and specialty consultations. Two models were run — GPT-4o-mini and GPT-4.1-mini, both from a single vendor — each across five random seeds, for 2,000 graded responses in total. No patient was involved and no decision was acted on. This is a controlled benchmark study of a prompting technique, and read at that altitude it is careful, well-reported work.

What the data show

The behavioural change is the genuine result, and it is unusually clean. On GPT-4.1-mini, the share of answers carrying a clarifying question went from 7.8 percent to 97.3 percent; on GPT-4o-mini, from zero to 73.5 percent. Hedging — explicit acknowledgement of uncertainty — rose by about twenty points on the stronger model. For a failure mode built on unearned confidence, converting a model that almost never pauses into one that almost always asks is the right reflex, and the prompt produced it reliably across all five seeds.

Overall response quality, scored against the benchmark's composite rubric, also rose — but here the two models diverge, and the gap is the part worth dwelling on. GPT-4.1-mini gained 16.6 percentage points (from 2.5 to 19.1 percent); GPT-4o-mini gained 2.2 (from 0.0 to 2.2 percent). The identical prompt that transformed one model barely registered on the other's quality score. A technique whose payoff swings this hard on which model carries it is one we do not yet understand well enough to generalise.

The number the abstract leaves out

There is a counter-current the headline does not advertise, and the authors are candid about it in their results table. As the models started asking and hedging, their communication-quality score fell — by 12.5 percentage points on GPT-4.1-mini (from 70.1 to 57.5 percent) and 11.3 on GPT-4o-mini. The authors read this as an artefact: the rubric was written to reward confident, declarative answers, so a reply that asks a question or flags a doubt scores worse even when it is the safer reply. That is a plausible and honest reading. It also means the same intervention pushes two of the study's own outcomes in opposite directions — overall quality up, communication quality down — which is precisely why a single rubric score cannot settle whether the patient is better served.

The effect sizes invite caution rather than applause. A Cohen's d of 11.56 on overall quality, or 16 and 19 on context-seeking, is not the signature of a subtle clinical signal; it is what you get when a prompt switches a behaviour almost fully on. That is consistent with the honest reading — the instruction forces the behaviour — but it should stop anyone from mistaking these numbers for a measure of better medicine. They measure compliance with an instruction, on a scale the same group helped define, across two models from one maker. The authors name the limits themselves: a single benchmark, two model families from one provider, no clinician in the loop, and chain-of-thought text that may not faithfully reflect what the model actually computed.

“Teaching a model to ask first is the right instinct against overconfidence. It is not yet evidence that the patient is better off.”

What a clinical leader should take from it

The endpoint here is a rubric score on written answers to vignettes — not a diagnosis acted on, not a test ordered or spared, not a patient followed forward. A model that asks more and hedges more is plausibly safer; demonstrated to be safer it is not, and on a benchmark it can look better simply by giving the graders what they reward. The useful takeaway is therefore not a framework to memorise. It is that how a system is prompted — before any retraining, before any new model — measurably decides whether it volunteers its uncertainty or buries it, and that this lever is cheap to pull and worth writing into a specification. For European systems weighing such tools under the Medizinprodukteverordnung (MDR) and the EU AI Act, the implication is procurement-shaped: ask not only how accurate a model is, but how it behaves when it is unsure — and insist on seeing that behaviour tested against something closer to your own patients than a static, English-language benchmark. A model that knows when to ask is a precondition for safe assistance, not proof of it.

Source: Arslan J, Benke K, Cajas Ordones SA, et al. Engineering framework for curiosity-driven and humble AI in clinical decision support. BMJ Health & Care Informatics 2026;33(1):e101877. A peer-reviewed evaluation of a prompting technique on a single static benchmark using two models from one vendor; its endpoints were rubric scores and response behaviour, not patient outcomes. Full text read via the open PubMed Central mirror.

#Journal Club#Clinical AI#Large Language Models#Patient Safety#Evidence-Based Medicine

Teaching a Language Model to Ask Before It Answers

The setup

What the data show

The number the abstract leaves out

What a clinical leader should take from it

Keep reading

Ninety Percent Started, Twenty-Six Finished: Germany's ePA in Hospitals

One Week Earlier: What an AI Wound Index Actually Beats

A Sixty-Second Morning Reading Before the Crash: What It Can and Cannot Tell You

This analysis comes from the people behind Visite.

Want to see this in your hospital?