LLM survey rating predictions hit a hard ceiling, study finds
A study by Andrew Hong, Jason Potteiger, and Luis E. Zapata finds that prompt engineering and model selection have limited impact on LLM-predicted experience ratings from open-ended survey text, with input quality dominating accuracy far more than either lever.
Score breakdown
Practitioners building LLM pipelines to extract structured signals from unstructured survey or feedback text should focus optimization effort on input quality and data collection design, not prompt tuning or model upgrades, since the missing information problem is a hard ceiling no engineering can overcome.
- 01A baseline unoptimized GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time from open-ended survey text.
- 02The study tested ~10,000 post-game surveys from five MLB teams across four model/prompt configurations.
- 03Prompt customization improved within ±1 agreement by roughly two percentage points on GPT 4.1 (67% → 69%).
Building on an earlier paper by the same authors, this study systematically tests whether prompt engineering or model selection can meaningfully push past the performance ceiling of LLM-predicted experience ratings. The experiment compared four configurations — a baseline prompt and a moderately customized version, each crossed with three GPT models (`GPT 4.1`, `GPT 4.1-mini`, and `GPT 5.2`) — evaluated on approximately 10,000 post-game surveys from five MLB teams. The baseline `GPT 4.1` prompt predicted fan-reported ratings within one point 67% of the time. Prompt customization added roughly two percentage points on `GPT 4.1`, reaching 69%. Switching to `GPT 5.2` returned performance to the baseline level, and `GPT 4.1-mini` fell six percentage points below it.
The authors decompose the performance ceiling into two distinct parts.
The central finding is that both engineering levers combined were dwarfed by the input itself: accuracy varied more than an order of magnitude more by the linguistic character of the text than by prompt or model choice. The authors decompose the performance ceiling into two distinct parts. The first is a correctable bias in how the model interprets text — prompt customization can address this. The second is a structural gap between what fans choose to write and what actually drives their ratings; no amount of prompt or model engineering can close this gap because the relevant information is simply absent from the text. The paper's conclusion is precise: prompt engineering helps in a specific, predictable way on the portion of the ceiling it can reach, but model selection moved neither part of the ceiling reliably.
Key facts
- 01A baseline unoptimized GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time from open-ended survey text.
- 02The study tested ~10,000 post-game surveys from five MLB teams across four model/prompt configurations.
- 03Prompt customization improved within ±1 agreement by roughly two percentage points on GPT 4.1 (67% → 69%).
- 04Switching to GPT 5.2 erased the prompt customization gain, returning to baseline performance.
- 05GPT 4.1-mini fell six percentage points below the baseline configuration.
- 06Accuracy varied more than an order of magnitude more by the linguistic character of input text than by prompt or model choice.
- 07The performance ceiling has two parts: a correctable model-reading bias (addressable by prompt design) and an irreducible gap from information not present in the text.