Apr 21, 2026·1 min readResearch Papers

LLMs show divergent consistency in exercise prescriptions

A study by Kihyuk Lee comparing GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash found that identical temperature=0 settings produce fundamentally different consistency behaviors in AI-generated exercise prescriptions, with GPT-4.1 generating unique outputs each time while Gemini relied heavily on text duplication.

ArXiv·Kihyuk Lee

Read at source

Composite

4.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Practitioners deploying LLMs in clinical or health-adjacent coding systems should evaluate models under repeated-generation conditions — not just single outputs — to distinguish genuine reasoning consistency from text duplication before trusting model outputs in high-stakes workflows.

01Study tested GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash at temperature=0, generating 360 total outputs across six clinical scenarios.
02Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001).
03GPT-4.1 produced 100% unique outputs while maintaining high semantic similarity, indicating consistent reasoning.

Summary— our read of the original

Kihyuk Lee's study evaluated how consistently three LLMs — GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash — generate exercise prescriptions under controlled conditions (temperature=0). Each model produced prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Inter-model differences in semantic similarity were statistically significant (H = 458.41, p < .001), with GPT-4.1 scoring highest (0.955), Gemini 2.5 Flash second (0.950), and Claude Sonnet 4.6 lowest (0.903).

A key finding is that similar semantic similarity scores can reflect entirely different underlying behaviors.

A key finding is that similar semantic similarity scores can reflect entirely different underlying behaviors. GPT-4.1 produced 100% unique outputs while maintaining stable semantic content, suggesting genuine reasoning consistency. Gemini 2.5 Flash, by contrast, showed pronounced output repetition — only 27.5% of its outputs were unique — meaning its high similarity score was driven by text duplication rather than consistent reasoning. Safety expression reached ceiling levels across all three models, rendering it ineffective as a differentiating metric.

The study concludes that repeated-generation testing should be treated as a core evaluation criterion for clinical LLM deployment, since single-output evaluations cannot surface these behavioral distinctions. The authors argue that model selection for LLM-based exercise prescription systems is a clinical decision with real implications for reliability and patient safety.

Key facts

01Study tested GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash at temperature=0, generating 360 total outputs across six clinical scenarios.
02Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001).
03GPT-4.1 produced 100% unique outputs while maintaining high semantic similarity, indicating consistent reasoning.
04Gemini 2.5 Flash had only 27.5% unique outputs, meaning its high similarity score was driven by text duplication, not consistent reasoning.
05Safety expression reached ceiling levels across all three models, making it an ineffective differentiating metric.
06Single-output evaluations cannot detect these behavioral differences between models.
07Authors conclude model selection for clinical LLM use is a clinical decision, not merely a technical one.

Topics

#benchmarks #model-release #reasoning #safety

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 19:13 UTC. How this works →

LLMs show divergent consistency in exercise prescriptions

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics