LLMs show divergent consistency in exercise prescriptions
A study by Kihyuk Lee comparing GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash found that identical temperature=0 settings produce fundamentally different consistency behaviors in AI-generated exercise prescriptions, with GPT-4.1 generating unique outputs each time while Gemini relied heavily on text duplication.
Score breakdown
Practitioners deploying LLMs in clinical or health-adjacent coding systems should evaluate models under repeated-generation conditions — not just single outputs — to distinguish genuine reasoning consistency from text duplication before trusting model outputs in high-stakes workflows.
- 01Study tested GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash at temperature=0, generating 360 total outputs across six clinical scenarios.
- 02Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001).
- 03GPT-4.1 produced 100% unique outputs while maintaining high semantic similarity, indicating consistent reasoning.
Kihyuk Lee's study evaluated how consistently three LLMs — GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash — generate exercise prescriptions under controlled conditions (temperature=0). Each model produced prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Inter-model differences in semantic similarity were statistically significant (H = 458.41, p < .001), with GPT-4.1 scoring highest (0.955), Gemini 2.5 Flash second (0.950), and Claude Sonnet 4.6 lowest (0.903).
A key finding is that similar semantic similarity scores can reflect entirely different underlying behaviors.
A key finding is that similar semantic similarity scores can reflect entirely different underlying behaviors. GPT-4.1 produced 100% unique outputs while maintaining stable semantic content, suggesting genuine reasoning consistency. Gemini 2.5 Flash, by contrast, showed pronounced output repetition — only 27.5% of its outputs were unique — meaning its high similarity score was driven by text duplication rather than consistent reasoning. Safety expression reached ceiling levels across all three models, rendering it ineffective as a differentiating metric.
The study concludes that repeated-generation testing should be treated as a core evaluation criterion for clinical LLM deployment, since single-output evaluations cannot surface these behavioral distinctions. The authors argue that model selection for LLM-based exercise prescription systems is a clinical decision with real implications for reliability and patient safety.
Key facts
- 01Study tested GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash at temperature=0, generating 360 total outputs across six clinical scenarios.
- 02Mean semantic similarity scores: GPT-4.1 (0.955), Gemini 2.5 Flash (0.950), Claude Sonnet 4.6 (0.903); inter-model differences were statistically significant (H = 458.41, p < .001).
- 03GPT-4.1 produced 100% unique outputs while maintaining high semantic similarity, indicating consistent reasoning.
- 04Gemini 2.5 Flash had only 27.5% unique outputs, meaning its high similarity score was driven by text duplication, not consistent reasoning.
- 05Safety expression reached ceiling levels across all three models, making it an ineffective differentiating metric.
- 06Single-output evaluations cannot detect these behavioral differences between models.