LLMs show divergent consistency in exercise prescriptions at temperature=0
A study by Kihyuk Lee comparing GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash found that identical decoding settings produce fundamentally different consistency profiles in AI-generated exercise prescriptions, with GPT-4.1 generating unique outputs while Gemini relied heavily on text duplication.
Score breakdown
Teams deploying LLMs in clinical or health-adjacent coding tools should test repeated generation behavior — not just single-output quality — since identical temperature settings can hide fundamentally different reliability profiles across models.
- 01Study author: Kihyuk Lee; compared GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash on exercise prescription generation
- 02Each model generated outputs for 6 clinical scenarios 20 times at temperature=0, yielding 360 total outputs
- 03Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001)
Kihyuk Lee's study examined how consistently three leading LLMs — GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash — generate exercise prescriptions under repeated conditions. Each model was prompted with six clinical scenarios 20 times at temperature=0, producing 360 total outputs evaluated across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. The study found statistically significant inter-model differences in semantic similarity (H = 458.41, p < .001), with GPT-4.1 scoring highest at 0.955, Gemini 2.5 Flash at 0.950, and Claude Sonnet 4.6 at 0.903.
Crucially, the study reveals that similar semantic similarity scores can mask fundamentally different generative behaviors.
Crucially, the study reveals that similar semantic similarity scores can mask fundamentally different generative behaviors. GPT-4.1 produced entirely unique outputs (100% unique) while maintaining stable semantic content — indicating genuine reasoning consistency. Gemini 2.5 Flash, by contrast, showed pronounced output repetition with only 27.5% unique outputs, meaning its high similarity score was driven by text duplication rather than consistent underlying reasoning. Safety expression reached ceiling levels across all three models, rendering it ineffective as a differentiating metric.
The authors conclude that identical decoding settings do not guarantee comparable output behavior across models, and that repeated generation testing should be a core evaluation criterion for clinical LLM deployments. Model selection in healthcare contexts, the study argues, is a clinical decision with real implications for reliability and patient safety.
Key facts
- 01Study author: Kihyuk Lee; compared GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash on exercise prescription generation
- 02Each model generated outputs for 6 clinical scenarios 20 times at temperature=0, yielding 360 total outputs
- 03Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001)
- 04GPT-4.1 produced 100% unique outputs while maintaining high semantic consistency
- 05Gemini 2.5 Flash had only 27.5% unique outputs, meaning its high similarity score derived from text duplication, not consistent reasoning
- 06Safety expression reached ceiling levels across all models, making it an ineffective differentiating metric