Skip to main content
Journal Club6 min read

People Come to Be Heard. Most Chatbots Reply With a List.

Three in four people who told a chatbot they felt low were not asking for advice — they were asking to be heard. A formative study of eight commercial systems shows most answered with information instead, and names the gap precisely.

Dr. Sven Jungmann

Dr. Sven Jungmann

CEO

Editorial collage of a person speaking toward a phone whose reply is a list of links rather than an answer, with a single amber accent.

Give eight popular assistants the same handful of sentences about feeling depressed, and they sort themselves into two camps. A social chatbot built to chat back answers with something that reads like sympathy three times out of four. Ask a voice assistant or a general-purpose model the same thing, and you mostly get information: a search result, or a tidy paragraph about breathing exercises. That split is the spine of a paper by Chin and colleagues, and it matters because of what people were doing when they reached for these systems in the first place — which, overwhelmingly, was not asking for advice.

The study was published in JMIR Formative Research in November 2025, and its tier deserves a sentence up front, because the tier sets the ceiling on what we may conclude. This is qualitative, formative work: human coders reading and categorising real conversations, plus a small head-to-head of how eight commercial systems answer depression-related prompts. No trial, no clinical outcome, no follow-up. It is not evidence that one design helps and another harms. It is a careful map of a mismatch — and the map is worth having.

What people actually brought

The demand side is the firmer of the two findings, and it is striking. Of the depression-related messages the coders classified, 75.3 percent (3,067 of 4,073) expressed feelings rather than sought anything. People wrote some version of I feel sad, I'm depressed, I feel alone. Only 4.1 percent (168) asked for a coping strategy; another 5.8 percent named isolation and loneliness directly. The overwhelming move was disclosure, not a request for a solution. People came to be heard.

Those messages came from SimSimi — a social chatbot the paper describes as having more than 400 million users — drawn across five English-speaking countries (Canada, Malaysia, the Philippines, the United Kingdom, the United States) between 2016 and 2021. The corpus was large: 13,700 utterances, half of them user messages and half replies. One researcher and five coders with a medical background sorted the messages against an established help-seeking framework and the replies into therapeutic communication styles, with strong inter-rater agreement (Fleiss' kappa of 0.87 and 0.89). The numbers above are not a small sample.

Who answers in kind, and who changes the subject

SimSimi itself answered in a way coded as therapeutic 77.7 percent of the time (2,417 of 3,108 relevant replies) — empathy in 29 percent of cases, active listening in 26.9 percent, open-ended questions in 21.8 percent. In the second part of the study, the team put the same kinds of prompts to eight systems — Alexa, Google Assistant, Siri, ChatGPT, Replika, Woebot, Wysa and SimSimi — using 45 standardised queries, and the field pulled apart. SimSimi and Replika still leaned warm; Replika, a companion app, returned an empathetic reply in more than three quarters of cases (28 of 36).

Everyone else answered a statement of distress by handing over information. The voice assistants returned literal search results — Alexa in 88.2 percent of cases, Google Assistant in 60 percent, Siri in 55.6 percent. ChatGPT coded as providing solutions 95.2 percent of the time, typically a long, well-meaning paragraph recommending yoga, deep breathing or meditation. Woebot, a mental-health chatbot, answered almost entirely with clarification questions (97.3 percent). None of this is malfunction. A search engine searching, a chatbot clarifying — each is doing its job. The job simply isn't the one a person who has just written I feel alone is asking for.

What the percentages cannot tell you

Here the design has to discipline the reading. These are categorisations of conversational style, not measurements of benefit. The coders judged whether a reply looked empathetic or informative; no one measured whether the empathetic replies left anyone better — less depressed, more likely to seek real help, safer. A warmer answer is plausibly kinder. The study cannot tell us it is more helpful, and it would be a mistake to read the empathy percentages as a clinical ranking. That Replika scores high on warmth says nothing about whether it is a safe place to bring a worsening mood.

The coders rated whether a reply looked empathetic. Nobody measured whether the empathetic replies left anyone better.

The other limits are wide, and the authors name them. The corpus runs only to 2021, so it largely predates today's most capable models — the field has shifted under the paper since the data were collected. It is English-only, built mainly on a single chatbot, and confined to single-turn exchanges, which cannot capture how a hard conversation unspools over many turns. The second study rests on a small set of prompts, and well over half of those responses were, by the authors' own account, contextually disconnected. That candour is to the paper's credit, and a reason to treat the system-by-system figures as illustrative rather than as a league table.

The line that isn't a design question

The authors do not let the reader forget that people sometimes bring more than sadness to these systems. In their discussion they cite a reported case of a user who took their own life after a six-week conversation with a chatbot. Whatever the precise causal story, the implication is not subtle. A system that invites people to confide, yet knows neither the limits of its own competence nor how to route a person to real help, is carrying a responsibility it was never built for. The authors' recommendation is the sober one: systems used this way should be designed with clinicians, should respond to risk signals, and should be honest with users about what they are and are not.

Why it matters here

No European health system is going to deploy SimSimi. The structural point survives the move regardless. People in distress reach for whatever is available and unjudging, and increasingly that is a general-purpose assistant optimised to be informative rather than present. As these tools edge toward triage and self-help — including inside software formally regulated under the Medical Device Regulation (MDR) and the EU AI Act (EU-KI-Verordnung) — the question this study sharpens is not whether a system can speak warmly. It is whether the thing a frightened person actually needs is the thing the system was built to give, and what follows when it is not.

Source: Chin H, Baek G, Cha C, Cha M. Chatbots' Empathetic Conversations and Responses: A Qualitative Study of Help-Seeking Queries on Depressive Moods Across 8 Commercial Conversational Agents. JMIR Formative Research 2025;9:e71538. A qualitative, formative study coding conversational style on largely pre-2021 data — it maps a mismatch between what users seek and what systems give, but measures no clinical outcome.

#Journal Club#Digital Mental Health#Conversational AI#Evidence-Based Medicine#Empathy

Keep reading

Editorial collage of an oncologist's hands on a thick claims ledger, with a teal three-column bar chart rising only partway and a single amber accent.
Journal Club

An Explainable Model, Honest Numbers, and a Funder Worth Noticing

An explainable AI model predicted how long myeloma patients would stay on treatment, using twenty years of Japanese claims data and 647 variables. The discrimination is modest and fairly reported. The part that needs a careful eye is who paid, and which finding they got.

Dr. Sven JungmannCEO
Editorial collage of four people mid-conversation arranged around a teal circle with a single amber dot at its centre.
Journal Club

Four Conversations About Clinical AI That Quietly Agree

Four NEJM AI podcast interviews, recorded months apart, keep landing in the same three places: a values vacuum, a bias we taught the machine, and a trust gap that tracks consequence. None of it is evidence. The agreement is still worth an hour.

Dr. Sven JungmannCEO
Editorial collage of a surgeon's gloved hands beside an anaesthesia monitor showing a teal arterial-pressure waveform, with a closed operating-room door suggested behind and a single amber accent.
Journal Club

Surgical AI That Works in the Paper but Not in the Room

A scoping review screened 275 records to find every AI model meant to prevent surgical complications and follow it to the bedside. Of 19 studies, the models were often accurate. Two are in routine use — and the bottleneck is not the algorithm.

Dr. Sven JungmannCEO

This analysis comes from the people behind Visite.

Our weekly newsletter on AI in medicine. Every Friday, rigorously checked.

By signing up you agree to receive Grand Rounds by email. Unsubscribe anytime. More in our privacy policy.

Want to see this in your hospital?

30 minutes. Your questions. Our physician-founder shows you the platform personally.

Book a demo

No commitment. No sales pitch. Physician to physician.