LLM coding agents outperform raw-data models on time series, but still miss 22–34% of questions
A study by Rechtorík, Dušek, and Kasner finds that LLM coding agents using iterative Python queries can outperform models given raw numerical time series data by up to 10%, yet even the best agent still answers 22–34% of questions incorrectly.
Score breakdown
Despite code access giving LLM agents a measurable edge on time series tasks, a 22–34% error rate on benchmark questions exposes a concrete reliability gap that limits their use in high-stakes automated decision-making domains like finance and healthcare.
- 01Three approaches are compared: raw numerical data input, LLM coding agent (iterative Python queries), and a hybrid of both.
- 02Coding agents outperform raw-data models by up to 10% on two time series understanding benchmarks.
- 03Even the best-performing agent answers 22–34% of questions incorrectly.
A paper by Filip Rechtorík, Ondřej Dušek, and Zdeněk Kasner investigates whether LLM agents can reliably analyze time series data — a data type that is ubiquitous in finance, healthcare, and environmental monitoring but difficult to process automatically. The researchers compare three approaches: feeding the agent raw numerical data, using the LLM as a coding agent that iteratively queries data via Python code, and a combination of both. Evaluation is conducted on two time series understanding benchmarks.
The results show that coding agents outperform raw-data models by up to 10%, but the gap is far from a clean victory: even the best-performing agent still answers approximately 22–34% of questions incorrectly.
The results show that coding agents outperform raw-data models by up to 10%, but the gap is far from a clean victory: even the best-performing agent still answers approximately 22–34% of questions incorrectly. To understand where models succeed and fail, the authors use a strong LLM judge to analyze model outputs. This analysis reveals a nuanced picture — coding agents demonstrate the ability to select appropriate statistical tests, yet they often miss important nuances in the data. Conversely, models that only receive raw numerical input can sometimes arrive at correct answers through informal back-of-the-envelope reasoning, suggesting complementary strengths between the two paradigms.
Key facts
- 01Three approaches are compared: raw numerical data input, LLM coding agent (iterative Python queries), and a hybrid of both.
- 02Coding agents outperform raw-data models by up to 10% on two time series understanding benchmarks.
- 03Even the best-performing agent answers 22–34% of questions incorrectly.
- 04A strong LLM judge is used to analyze model outputs and identify reasoning gaps.
- 05Coding agents can select appropriate statistical tests but often miss important nuances.
- 06Raw-data models can reach correct conclusions using back-of-the-envelope calculations.
- 07The study is motivated by LLM use in automated decision-making in finance, healthcare, and environmental monitoring.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →