Study finds LLMs work best as clinical communication aids, not doctor replacements
A multidimensional evaluation by Barone, Di Serio, and Moio finds that LLMs can enhance physician-patient communication through collaborative rewriting, but no model surpasses physicians on epistemic criteria.
Score breakdown
AI/coding practitioners building clinical or healthcare-facing LLM applications should design systems around collaborative rewriting workflows rather than direct generation, as rephrase configurations demonstrably outperform baseline prompting on readability, semantic fidelity, and emotional tone.
- 01Baseline LLMs amplify negative affective polarity vs. physicians: 43.14–45.10% very negative sentiment vs. 37.25% for physicians.
- 02Larger models like GPT-5 and Claude produce FKGL scores of 16.91–17.60, far above physician-authored responses at 11.47–12.50.
- 03Empathy-oriented prompting reduces grade-level complexity by up to 6.87 FKGL points for GPT-5 but does not significantly improve semantic fidelity.
A paper by Mariano Barone, Francesco Di Serio, and Roberto Moio presents a multidimensional evaluation of LLMs in clinical communication settings, examining both general-purpose and domain-specialized models across structured medical explanations and real-world physician-patient interactions. The study measures three dimensions: semantic fidelity (how closely model outputs match physician responses), readability (using the Flesch-Kincaid Grade Level, or FKGL, scale), and affective resonance (emotional tone and polarity). Baseline models were found to amplify affective polarity relative to physicians — very negative sentiment ranged from 43.14–45.10% in models versus 37.25% for physicians — and larger architectures such as GPT-5 and Claude produced substantially higher linguistic complexity, with FKGL scores of 16.91–17.60 compared to 11.47–12.50 in physician-authored responses.
The study tested two intervention strategies: empathy-oriented prompting and collaborative rewriting.
The study tested two intervention strategies: empathy-oriented prompting and collaborative rewriting. Empathy-oriented prompting reduced extreme negativity and lowered grade-level complexity by up to 6.87 FKGL points for GPT-5, but did not significantly increase semantic fidelity to physician responses. Collaborative rewriting — specifically "rephrase" configurations — yielded the strongest overall alignment, achieving semantic similarity scores as high as a mean of 0.93 to physician answers while consistently improving readability and reducing affective extremity. A dual stakeholder evaluation revealed that no model surpassed physicians on epistemic criteria, while patients consistently preferred rewritten model variants for their clarity and emotional tone. The authors conclude that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.
Key facts
- 01Baseline LLMs amplify negative affective polarity vs. physicians: 43.14–45.10% very negative sentiment vs. 37.25% for physicians.
- 02Larger models like GPT-5 and Claude produce FKGL scores of 16.91–17.60, far above physician-authored responses at 11.47–12.50.
- 03Empathy-oriented prompting reduces grade-level complexity by up to 6.87 FKGL points for GPT-5 but does not significantly improve semantic fidelity.
- 04Collaborative rewriting ('rephrase' configurations) achieves the highest semantic similarity to physician answers, with a mean score up to 0.93.
- 05No model surpasses physicians on epistemic criteria in dual stakeholder evaluation.