LLM survey rating predictions hit a hard ceiling set by input text
A study by Andrew Hong, Jason Potteiger, and Luis E. Zapata finds that prompt design and model selection have far less impact on LLM-predicted experience ratings than the linguistic content of the survey text itself.
Score breakdown
Practitioners using LLMs to extract structured signals from open-ended text should invest in understanding input data quality first — prompt tuning and model upgrades offer only marginal, bounded gains when the key information is absent from the source text.
- 01Baseline GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time, established in a prior paper by the same authors.
- 02A moderately customized prompt improved within-±1 agreement by roughly two percentage points (67% to 69%) on GPT 4.1.
- 03GPT 5.2 swapped in for the best configuration returned performance to the 67% baseline.
Andrew Hong, Jason Potteiger, and Luis E. Zapata build on an earlier paper (Hong, Potteiger, and Zapata 2026) that established a baseline: an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This follow-up paper isolates the relative contribution of two common engineering levers — prompt customization and model selection — by testing four configurations: the original baseline prompt and a moderately customized version, each crossed with three GPT models (4.1, 4.1-mini, and 5.2), evaluated on approximately 10,000 post-game surveys from five MLB teams.
The results show that prompt customization added roughly two percentage points of within-±1 agreement on GPT 4.1 (from 67% to 69%).
The results show that prompt customization added roughly two percentage points of within-±1 agreement on GPT 4.1 (from 67% to 69%). Model swaps from that best configuration both degraded performance: GPT 5.2 returned to the 67% baseline, and GPT 4.1-mini fell six percentage points below it. Most strikingly, across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the input text than by the choice of prompt or model.
The authors frame the performance ceiling as having two distinct components. The first is a correctable model reading bias, which prompt customization can address. The second is a fundamental information gap: fans write about different things than what actually drives their ratings, and no amount of prompt or model engineering can recover information that was never in the text. The paper's conclusion is precise — prompt engineering helps in a specific, predictable way on the portion of the ceiling it can reach, but the dominant constraint is the signal quality of the input itself.
Key facts
- 01Baseline GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time, established in a prior paper by the same authors.
- 02A moderately customized prompt improved within-±1 agreement by roughly two percentage points (67% to 69%) on GPT 4.1.
- 03GPT 5.2 swapped in for the best configuration returned performance to the 67% baseline.
- 04GPT 4.1-mini fell six percentage points below the baseline when substituted.
- 05Accuracy varied more than an order of magnitude more by the linguistic character of input text than by prompt or model choice.
- 06The performance ceiling has two parts: a correctable model reading bias and an uncloseable information gap where fans' written responses don't capture what drives their actual ratings.