Apr 22, 2026·1 min readResearch Papers

Study finds LLMs work best as clinical communication aids, not doctor replacements

A multidimensional evaluation by Barone, Di Serio, and Moio finds that LLMs can enhance physician-patient communication through collaborative rewriting, but no model surpasses physicians on epistemic criteria.

ArXiv·Mariano Barone, Francesco Di Serio, Roberto Moio

Read at source

Composite

5.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

AI/coding practitioners building clinical or healthcare-facing LLM applications should design systems around collaborative rewriting workflows rather than direct generation, as rephrase configurations demonstrably outperform baseline prompting on readability, semantic fidelity, and emotional tone.

01Baseline LLMs amplify negative affective polarity vs. physicians: 43.14–45.10% very negative sentiment vs. 37.25% for physicians.
02Larger models like GPT-5 and Claude produce FKGL scores of 16.91–17.60, far above physician-authored responses at 11.47–12.50.
03Empathy-oriented prompting reduces grade-level complexity by up to 6.87 FKGL points for GPT-5 but does not significantly improve semantic fidelity.

Summary— our read of the original

A paper by Mariano Barone, Francesco Di Serio, and Roberto Moio presents a multidimensional evaluation of LLMs in clinical communication settings, examining both general-purpose and domain-specialized models across structured medical explanations and real-world physician-patient interactions. The study measures three dimensions: semantic fidelity (how closely model outputs match physician responses), readability (using the Flesch-Kincaid Grade Level, or FKGL, scale), and affective resonance (emotional tone and polarity). Baseline models were found to amplify affective polarity relative to physicians — very negative sentiment ranged from 43.14–45.10% in models versus 37.25% for physicians — and larger architectures such as GPT-5 and Claude produced substantially higher linguistic complexity, with FKGL scores of 16.91–17.60 compared to 11.47–12.50 in physician-authored responses.

The study tested two intervention strategies: empathy-oriented prompting and collaborative rewriting.

The study tested two intervention strategies: empathy-oriented prompting and collaborative rewriting. Empathy-oriented prompting reduced extreme negativity and lowered grade-level complexity by up to 6.87 FKGL points for GPT-5, but did not significantly increase semantic fidelity to physician responses. Collaborative rewriting — specifically "rephrase" configurations — yielded the strongest overall alignment, achieving semantic similarity scores as high as a mean of 0.93 to physician answers while consistently improving readability and reducing affective extremity. A dual stakeholder evaluation revealed that no model surpassed physicians on epistemic criteria, while patients consistently preferred rewritten model variants for their clarity and emotional tone. The authors conclude that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

Key facts

01Baseline LLMs amplify negative affective polarity vs. physicians: 43.14–45.10% very negative sentiment vs. 37.25% for physicians.
02Larger models like GPT-5 and Claude produce FKGL scores of 16.91–17.60, far above physician-authored responses at 11.47–12.50.
03Empathy-oriented prompting reduces grade-level complexity by up to 6.87 FKGL points for GPT-5 but does not significantly improve semantic fidelity.
04Collaborative rewriting ('rephrase' configurations) achieves the highest semantic similarity to physician answers, with a mean score up to 0.93.
05No model surpasses physicians on epistemic criteria in dual stakeholder evaluation.
06Patients consistently prefer rewritten model variants for clarity and emotional tone.
07Authors conclude LLMs are best suited as collaborative communication enhancers, not replacements for clinical expertise.

Topics

#llm-evaluation #clinical-nlp #prompt-engineering #benchmarks #safety

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

Apr 22, 2026·1 min readResearch Papers

Study finds LLMs work best as clinical communication aids, not doctor replacements

ArXiv·Mariano Barone, Francesco Di Serio, Roberto Moio

Read at source

Composite

5.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Baseline LLMs amplify negative affective polarity vs. physicians: 43.14–45.10% very negative sentiment vs. 37.25% for physicians.
02Larger models like GPT-5 and Claude produce FKGL scores of 16.91–17.60, far above physician-authored responses at 11.47–12.50.
03Empathy-oriented prompting reduces grade-level complexity by up to 6.87 FKGL points for GPT-5 but does not significantly improve semantic fidelity.

Summary— our read of the original

The study tested two intervention strategies: empathy-oriented prompting and collaborative rewriting.

Key facts

01Baseline LLMs amplify negative affective polarity vs. physicians: 43.14–45.10% very negative sentiment vs. 37.25% for physicians.
02Larger models like GPT-5 and Claude produce FKGL scores of 16.91–17.60, far above physician-authored responses at 11.47–12.50.
03Empathy-oriented prompting reduces grade-level complexity by up to 6.87 FKGL points for GPT-5 but does not significantly improve semantic fidelity.
04Collaborative rewriting ('rephrase' configurations) achieves the highest semantic similarity to physician answers, with a mean score up to 0.93.
05No model surpasses physicians on epistemic criteria in dual stakeholder evaluation.
06Patients consistently prefer rewritten model variants for clarity and emotional tone.
07Authors conclude LLMs are best suited as collaborative communication enhancers, not replacements for clinical expertise.

Topics

#llm-evaluation #clinical-nlp #prompt-engineering #benchmarks #safety

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics