When It Comes To Empathy, ChatGPT Is Acting More 'Human' Than Some Doctors

studyfinds.org
robot dr

'Botside' manner may soon replace bedside manner. (Credit: Andrey_Popov on Shutterstock)

If AI’s empathy advantage extends to voice interactions, it could revolutionize patient care for millions. In A Nutshell
  • AI chatbots like ChatGPT scored roughly 2 points higher than doctors on 10-point empathy scales
  • The advantage held across 13 of 15 studies examining cancer, mental health, thyroid conditions, and other medical questions
  • All studies evaluated text-only interactions; results may not apply to in-person or voice consultations
  • One in five UK doctors already uses ChatGPT for tasks like writing patient correspondence
  • Future research needs to test whether patients receiving actual care perceive the same empathy advantage
  • Healthcare workers and patients feel more warmth from AI-generated medical responses than from actual doctors, a surprising analysis of 15 studies shows. The largest study examined 2,164 patient interactions, with similar patterns emerging across smaller datasets.

    ChatGPT and similar AI chatbots scored roughly two points higher than human healthcare professionals on 10-point empathy scales when responding to patient questions via text. AI had a 73% probability of being rated as more empathic than human practitioners in head-to-head comparisons.

    “In text-only scenarios, AI chatbots are frequently perceived as more empathic than human HCPs,” study authors wrote. The meta-analysis from the Universities of Nottingham and Leicester pooled data from 13 of the 15 studies comparing AI chatbots to doctors, nurses, and other healthcare workers.

    The results, published in the British Medical Bulletin, challenge long-held assumptions about human connection in medicine and run counter to a 2019 UK government report that called empathy an “essential human skill that AI cannot replicate.”

    AI Shows Empathy Edge Across Medical Specialties

    ChatGPT-4 outperformed human clinicians in nine separate studies spanning cancer care, thyroid conditions, mental health, autism, and general medical inquiries. For thyroid questions, the AI scored 1.42 standard deviations above human surgeons in empathy ratings. Mental health queries showed similar patterns, with ChatGPT-4 scoring 0.97 standard deviations higher than licensed mental health professionals.

    Patient complaints revealed the starkest gaps. When handling grievances across hospital departments, ChatGPT-4 scored 2.08 standard deviations higher than human patient relations officers.

    The AI advantage appeared consistent regardless of who evaluated the responses. When both physicians and patients reviewed the same set of answers about systemic lupus, ChatGPT-4 received higher empathy ratings from physicians. For questions about multiple sclerosis, patient representatives using a validated empathy scale rated AI responses more favorably than neurologist responses.

    Studies drawing from Reddit health forums and patient portals showed similar trends. Questions ranged from interpreting blood test results to managing chronic conditions to understanding cancer treatment options. Across this variety, AI responses were more likely to be rated as warm, understanding, and considerate of patient concerns.

    ChatGPT on smartphoneThe analysis mostly focused on ChatGPT, but also included additional LLMs like Gemini Pro and Claude. (Photo by Tada Images on Shutterstock)

    Dermatology provided the sole exception. In both studies examining skin-related questions, dermatologists outperformed ChatGPT-3.5 and Med-PaLM 2, though researchers couldn’t explain this specialty-specific pattern.

    The Text Message Caveat

    All studies evaluated text-based interactions exclusively. Even when one study converted AI responses to audio, empathy ratings came from written transcripts alone.

    A doctor’s nod, forward lean, or eye contact often conveys understanding as powerfully as words. Text-based healthcare interactions represent a small portion of patient care, though their use grows with patient portals and telemedicine.

    Studies also relied on proxy evaluators rather than patients receiving actual care. Healthcare professionals, medical students, patient representatives, and researchers rated empathy in responses to real patient questions. Direct patient feedback might differ, particularly since healthcare providers and patients often rate empathy differently.

    Most studies used custom, unvalidated empathy scales. Raters typically scored responses on 1-5 or 1-10 scales ranging from “not empathetic” to “very empathetic.” Only one study employed the CARE scale, a validated 10-item instrument designed specifically for measuring therapeutic empathy in clinical consultations.

    The studies couldn’t determine whether AI’s perceived empathy advantage translates to better health outcomes. While empathic communication has been linked to reduced patient pain and anxiety, improved medication adherence, and higher satisfaction with care, these studies measured perception rather than clinical impact.

    Twenty Percent of UK Doctors Already Use ChatGPT

    The research lands as AI adoption in healthcare accelerates. One in five UK general practitioners now uses generative AI tools for tasks like writing patient correspondence. Over 117,000 patients across 31 NHS mental health services have interacted with Wysa, an AI-powered digital therapist, according to Wysa’s website.

    Study authors propose a collaborative model where doctors draft initial responses while AI enhances tone and empathic language, with clinicians ensuring medical accuracy. This approach could reduce physician workload while potentially improving patient satisfaction.

    Empathic delivery means little if medical advice proves wrong. AI reliability concerns persist, and gains in perceived warmth could vanish if responses contain factual errors or incomplete guidance.

    How the Research Was Conducted

    Researchers searched seven databases for studies published through November 2024, identifying 15 qualifying studies from 2023-2024. Most used unvalidated single-item scales asking raters to score empathy from 1-5 or 1-10. Only one employed a validated instrument, the CARE scale designed for measuring therapeutic empathy.

    Woman upset while at doctor's office Many patients complain their human doctors lack empathy. (Photo by Krakenimages.com on Shutterstock)

    Fourteen studies assessed ChatGPT variants (versions 3.5 or 4), while others examined Claude, Gemini Pro, Le Chat, ERNIE Bot, and Med-PaLM 2. Patient questions came from emails in private medical records, Reddit and public forums, real-time chat transcripts, and in-person reception interactions. The largest dataset included 2,164 live outpatient queries at a Chinese hospital.

    Nine studies had moderate risk of bias; six showed serious risk. Common problems included curated patient queries potentially skewing results, reliance on Reddit communities where users may face barriers to formal care, and supervised AI designs where human experts reviewed outputs before release.

    Telephone consultations account for 26% of general practitioner appointments in the UK. Emerging voice-enabled AI systems like ChatGPT’s Advanced Voice Mode are marketed with claims about responding with emotion and picking up on non-verbal cues, but no studies have tested these capabilities against human practitioners in spoken interactions. The research team says voice-based head-to-head tests are still needed. If AI’s empathy advantage extends to voice, it could reshape how millions of patients receive care.

    Paper Summary Methodology

    The systematic review followed PRISMA 2020 guidelines and searched seven databases (PubMed, Cochrane Library, Embase, PsycINFO, CINAHL, Scopus, IEEE Xplore) from inception through November 11, 2024. Researchers included studies that empirically compared empathy between AI chatbots using large language models and human healthcare professionals. Eligible studies involved real patients, healthcare users, or authentic patient-generated data such as emails, portal messages, or public forum posts. The team excluded hypothetical patient scenarios, rule-based AI systems, and interactions outside healthcare contexts. Two reviewers independently screened titles, abstracts, and full texts, with discrepancies resolved through discussion. Data extraction covered study design, participants, settings, AI interventions, human comparators, empathy measures, and key findings.

    Results

    Fifteen studies published in 2023-2024 met inclusion criteria. Thirteen studies provided data suitable for meta-analysis. The pooled analysis showed AI chatbots (specifically ChatGPT-3.5 and ChatGPT-4) demonstrated significantly higher empathy ratings than human practitioners, with a standardized mean difference of 0.87 (95% CI: 0.54-1.20, P<0.00001). Thirteen studies reported statistically significant advantages for AI systems, while two dermatology studies favored human responses. ChatGPT-4 showed more consistent results than ChatGPT-3.5, though statistical analysis found no significant difference between the two versions. All studies evaluated text-based interactions, with empathy assessed by proxy raters including healthcare professionals, medical students, patient representatives, and researchers using blinded evaluations.

    Limitations

    The study had several important limitations. All interactions were text-based, excluding non-verbal communication cues like body language and tone that typically contribute to empathy in healthcare consultations. Empathy evaluations came from proxy raters rather than patients directly receiving care, and research shows these groups often rate empathy differently. Only two of 15 studies involved healthcare professionals replying to their own patients with access to medical records and prior care context; remaining studies assessed one-off interactions, often from public forums like Reddit. Most studies used unvalidated empathy measures such as custom single-item Likert scales rather than validated instruments. Study populations were predominantly from Western countries, limiting generalizability. Six studies showed serious risk of bias, with common issues including curated patient queries, reliance on Reddit communities with potentially unrepresentative users, and supervised AI designs where human experts reviewed outputs. Fourteen of 15 studies focused on ChatGPT variants, limiting insight into other AI systems used in clinical practice.

    Funding and Disclosures

    The research received no specific funding from agencies in the public, commercial, or not-for-profit sectors. All authors declared no competing financial interests or personal relationships that could have influenced the work.

    Publication Details

    Howcroft A, Bennett-Weston A, Khan A, Griffiths J, Gay S, Howick J. “AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care,” published in the British Medical Bulletin on October 20, 2025;156:1-13. doi:10.1093/bmb/ldaf017