Apr 21, 2026·1 min readResearch Papers

LLMs show divergent consistency in exercise prescriptions at temperature=0

A study by Kihyuk Lee comparing GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash found that identical decoding settings produce fundamentally different consistency profiles in AI-generated exercise prescriptions, with GPT-4.1 generating unique outputs while Gemini relied heavily on text duplication.

ArXiv·Kihyuk Lee

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams deploying LLMs in clinical or health-adjacent coding tools should test repeated generation behavior — not just single-output quality — since identical temperature settings can hide fundamentally different reliability profiles across models.

01Study author: Kihyuk Lee; compared GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash on exercise prescription generation
02Each model generated outputs for 6 clinical scenarios 20 times at temperature=0, yielding 360 total outputs
03Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001)

Summary— our read of the original

Kihyuk Lee's study examined how consistently three leading LLMs — GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash — generate exercise prescriptions under repeated conditions. Each model was prompted with six clinical scenarios 20 times at temperature=0, producing 360 total outputs evaluated across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. The study found statistically significant inter-model differences in semantic similarity (H = 458.41, p < .001), with GPT-4.1 scoring highest at 0.955, Gemini 2.5 Flash at 0.950, and Claude Sonnet 4.6 at 0.903.

Crucially, the study reveals that similar semantic similarity scores can mask fundamentally different generative behaviors.

Crucially, the study reveals that similar semantic similarity scores can mask fundamentally different generative behaviors. GPT-4.1 produced entirely unique outputs (100% unique) while maintaining stable semantic content — indicating genuine reasoning consistency. Gemini 2.5 Flash, by contrast, showed pronounced output repetition with only 27.5% unique outputs, meaning its high similarity score was driven by text duplication rather than consistent underlying reasoning. Safety expression reached ceiling levels across all three models, rendering it ineffective as a differentiating metric.

The authors conclude that identical decoding settings do not guarantee comparable output behavior across models, and that repeated generation testing should be a core evaluation criterion for clinical LLM deployments. Model selection in healthcare contexts, the study argues, is a clinical decision with real implications for reliability and patient safety.

Key facts

01Study author: Kihyuk Lee; compared GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash on exercise prescription generation
02Each model generated outputs for 6 clinical scenarios 20 times at temperature=0, yielding 360 total outputs
03Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001)
04GPT-4.1 produced 100% unique outputs while maintaining high semantic consistency
05Gemini 2.5 Flash had only 27.5% unique outputs, meaning its high similarity score derived from text duplication, not consistent reasoning
06Safety expression reached ceiling levels across all models, making it an ineffective differentiating metric
07The study concludes model selection for clinical LLM use is a clinical decision, and repeated generation testing should be a core deployment criterion

Topics

#benchmarks #model-evaluation #llm-consistency #safety #reasoning

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 11:07 UTC. How this works →

LLMs show divergent consistency in exercise prescriptions at temperature=0

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics