Gemini 3.1 Pro beats 3.5 Flash on cost despite higher token price
An analysis of ~3,300 paired agent runs across four Gemini models found that Gemini 3.1 Pro costs $0.66 per task versus $1.05 for Gemini 3.5 Flash, despite nearly identical scores of 87.9 and 88.6 respectively — because token volume, not list price, drives real-world agent costs.
Score breakdown
For agentic workloads, the analysis shows that a model's per-token list price is a misleading cost signal — turn count and token volume at runtime determine the actual bill, making session-log auditing the only reliable way to compare model costs.
- 01~3,300 paired skill-eval runs were conducted across four Gemini models in OpenHands, ~800 tasks per model.
- 02Gemini 3.5 Flash costs $1.05/task; Gemini 3.1 Pro costs $0.66/task — despite Pro having a higher list price of $2.00/M tokens vs. Flash's $1.50/M.
- 03Scores were nearly identical: Gemini 3.5 Flash scored 88.6, Gemini 3.1 Pro scored 87.9.
Rob Willoughby and Baptiste Fernandez at Tessl benchmarked four Gemini models — `3.1 Flash Lite`, `3 Flash Preview`, `3.1 Pro Preview`, and `3.5 Flash` — by running every task twice (once with a relevant skill applied, once without) across roughly 800 tasks per model in OpenHands, for approximately 3,300 paired runs total. Rather than relying on dashboard estimates, they extracted per-call token counts from agent session logs and applied Google's published per-token prices to compute actual per-task costs.
The headline finding is that cost order and model name order are uncorrelated.
The headline finding is that cost order and model name order are uncorrelated. Gemini 3.5 Flash carries a list price of $1.50/M tokens but costs $1.05 per task because it consumed 1.41 million input tokens across 39 agent turns. Gemini 3.1 Pro lists at $2.00/M tokens yet costs only $0.66 per task by using roughly half the token volume over 26 turns — producing scores of 88.6 and 87.9 respectively, a difference the authors describe as effectively identical. Gemini 4.5 Flash and Gemini 4.5 Flash-Lite, in the same product family, differ dramatically in actual spend. The analysis also notes that 63–75% of input across runs was cache-read, meaning turn count accumulates cost through the cache-read multiplier rather than the headline list price.
The article also examines how adding a skill affects cost differently by model capability. For Gemini 3.1 Pro, a relevant skill reduced per-task cost by $0.20 (–23%) while adding 20 score points, because the model could follow structured guidance directly and reduce exploratory turns. For Gemini 3.5 Flash, cost shifted by less than $0.03 in either direction — the skill added context to process rather than a shortcut to apply. The authors frame this as: a skill compresses the solution path for a capable model and becomes overhead for a weaker one. The two identified operating points are Gemini 3.1 Pro with a skill at $0.66/task for top-tier performance, and Gemini 3 Flash Preview at $0.135/task for workloads where a score around 85 is acceptable.
Key facts
- 01~3,300 paired skill-eval runs were conducted across four Gemini models in OpenHands, ~800 tasks per model.
- 02Gemini 3.5 Flash costs $1.05/task; Gemini 3.1 Pro costs $0.66/task — despite Pro having a higher list price of $2.00/M tokens vs. Flash's $1.50/M.
- 03Scores were nearly identical: Gemini 3.5 Flash scored 88.6, Gemini 3.1 Pro scored 87.9.
- 04Gemini 3.5 Flash consumed 1.41M input tokens across 39 agent turns per task; Pro used roughly half that volume over 26 turns.
- 0563–75% of input tokens across runs were cache-read, making turn count the dominant cost driver.
- 06Adding a skill cut Gemini 3.1 Pro's per-task cost by $0.20 (–23%) and added 20 score points; Gemini 3.5 Flash's cost barely moved (<$0.03).
- 07Gemini 3 Flash Preview scores 85.4 at $0.135/task — roughly one-fifth the cost of the top-scoring models.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 13, 2026 · 08:58 UTC. How this works →