LakeQA benchmark tests LLM agents on million-scale data lake search and reasoning
LakeQA is a new benchmark requiring LLM agents to both search a ~9.5 TB heterogeneous data lake and perform multi-hop reasoning to answer questions, with GPT-5.2 achieving only 18.37% exact-match accuracy.
Score breakdown
LakeQA exposes a significant performance gap in frontier LLMs — including GPT-5.2 at 18.37% exact-match — on tasks that require jointly searching a massive heterogeneous data lake and performing multi-hop reasoning, a combination absent from prior comprehensive benchmarks.
- 01LakeQA is a new benchmark for search-centric QA over large-scale, heterogeneous data lakes.
- 02The data lake contains approximately 9.5 TB of text from Wikipedia and open-source government data, covering both structured and unstructured formats.
- 03Each benchmark sample is annotated by at least one Ph.D.-level expert.
Haonan Wang, Jiaxiang Liu, and Yurong Liu present LakeQA, a benchmark targeting a gap in existing QA evaluation: most benchmarks supply evidence explicitly or make retrieval trivial, whereas real-world questions require searching through massive, heterogeneous data collections before any reasoning can begin. LakeQA addresses this by building tasks over roughly 9.5 TB of text drawn from Wikipedia and open-source government data, spanning both structured and unstructured formats. Every sample is annotated by at least one Ph.D.-level expert to ensure quality, and each task demands long-horizon multi-hop reasoning with implicit intermediate steps — agents must independently locate the correct documents and then synthesize evidence across multiple sources to produce a final answer.
Experiments on seven frontier LLMs reveal that LakeQA is substantially harder than existing benchmarks.
Experiments on seven frontier LLMs reveal that LakeQA is substantially harder than existing benchmarks. GPT-5.2, one of the models tested, achieves an exact-match score of only 18.37%, illustrating that current models struggle when search and reasoning must be tightly coupled at scale. The authors position LakeQA as a realistic testbed for developing LLM agents capable of operating over modern data lakes, where evidence is neither pre-retrieved nor guaranteed to be present in a single document.
Key facts
- 01LakeQA is a new benchmark for search-centric QA over large-scale, heterogeneous data lakes.
- 02The data lake contains approximately 9.5 TB of text from Wikipedia and open-source government data, covering both structured and unstructured formats.
- 03Each benchmark sample is annotated by at least one Ph.D.-level expert.
- 04Tasks require long-horizon multi-hop reasoning with implicit intermediate steps across multiple documents.
- 05Seven frontier LLMs were evaluated on LakeQA.
- 06GPT-5.2 achieves an exact-match score of only 18.37% on LakeQA.
- 07The benchmark is designed to jointly evaluate both searching and reasoning capabilities of LLM agents.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 10, 2026 · 15:34 UTC. How this works →