Apr 21, 2026·1 min readResearch Papers

LLM survey rating predictions hit a hard ceiling set by input text

A study by Andrew Hong, Jason Potteiger, and Luis E. Zapata finds that prompt design and model selection have far less impact on LLM-predicted experience ratings than the linguistic content of the survey text itself.

ArXiv·Andrew Hong, Jason Potteiger, Luis E. Zapata

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Practitioners using LLMs to extract structured signals from open-ended text should invest in understanding input data quality first — prompt tuning and model upgrades offer only marginal, bounded gains when the key information is absent from the source text.

01Baseline GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time, established in a prior paper by the same authors.
02A moderately customized prompt improved within-±1 agreement by roughly two percentage points (67% to 69%) on GPT 4.1.
03GPT 5.2 swapped in for the best configuration returned performance to the 67% baseline.

Summary— our read of the original

Andrew Hong, Jason Potteiger, and Luis E. Zapata build on an earlier paper (Hong, Potteiger, and Zapata 2026) that established a baseline: an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This follow-up paper isolates the relative contribution of two common engineering levers — prompt customization and model selection — by testing four configurations: the original baseline prompt and a moderately customized version, each crossed with three GPT models (4.1, 4.1-mini, and 5.2), evaluated on approximately 10,000 post-game surveys from five MLB teams.

The results show that prompt customization added roughly two percentage points of within-±1 agreement on GPT 4.1 (from 67% to 69%).

The results show that prompt customization added roughly two percentage points of within-±1 agreement on GPT 4.1 (from 67% to 69%). Model swaps from that best configuration both degraded performance: GPT 5.2 returned to the 67% baseline, and GPT 4.1-mini fell six percentage points below it. Most strikingly, across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the input text than by the choice of prompt or model.

The authors frame the performance ceiling as having two distinct components. The first is a correctable model reading bias, which prompt customization can address. The second is a fundamental information gap: fans write about different things than what actually drives their ratings, and no amount of prompt or model engineering can recover information that was never in the text. The paper's conclusion is precise — prompt engineering helps in a specific, predictable way on the portion of the ceiling it can reach, but the dominant constraint is the signal quality of the input itself.

Key facts

01Baseline GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time, established in a prior paper by the same authors.
02A moderately customized prompt improved within-±1 agreement by roughly two percentage points (67% to 69%) on GPT 4.1.
03GPT 5.2 swapped in for the best configuration returned performance to the 67% baseline.
04GPT 4.1-mini fell six percentage points below the baseline when substituted.
05Accuracy varied more than an order of magnitude more by the linguistic character of input text than by prompt or model choice.
06The performance ceiling has two parts: a correctable model reading bias and an uncloseable information gap where fans' written responses don't capture what drives their actual ratings.
07The study evaluated approximately 10,000 post-game surveys from five MLB teams across four configurations.

Topics

#prompt-engineering #benchmarks #llm-evaluation #methodology

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 19:13 UTC. How this works →

Apr 21, 2026·1 min readResearch Papers

LLM survey rating predictions hit a hard ceiling set by input text

ArXiv·Andrew Hong, Jason Potteiger, Luis E. Zapata

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Baseline GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time, established in a prior paper by the same authors.
02A moderately customized prompt improved within-±1 agreement by roughly two percentage points (67% to 69%) on GPT 4.1.
03GPT 5.2 swapped in for the best configuration returned performance to the 67% baseline.

Summary— our read of the original

The results show that prompt customization added roughly two percentage points of within-±1 agreement on GPT 4.1 (from 67% to 69%).

Key facts

01Baseline GPT 4.1 prompt predicted fan experience ratings within ±1 point 67% of the time, established in a prior paper by the same authors.
02A moderately customized prompt improved within-±1 agreement by roughly two percentage points (67% to 69%) on GPT 4.1.
03GPT 5.2 swapped in for the best configuration returned performance to the 67% baseline.
04GPT 4.1-mini fell six percentage points below the baseline when substituted.
05Accuracy varied more than an order of magnitude more by the linguistic character of input text than by prompt or model choice.
06The performance ceiling has two parts: a correctable model reading bias and an uncloseable information gap where fans' written responses don't capture what drives their actual ratings.
07The study evaluated approximately 10,000 post-game surveys from five MLB teams across four configurations.

Topics

#prompt-engineering #benchmarks #llm-evaluation #methodology

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics