DeepSeek V4 Flash matches Haiku 4.5 quality at a quarter of the cost
A Tessl benchmark of 19 model configurations finds DeepSeek V4 Flash scores 82.3 on agentic evals versus Haiku 4.5's 82.9, while costing $0.0236/task compared to Haiku's $0.10 — making it the highest points-per-dollar model tested.
Score breakdown
The benchmark shows that skill augmentation and turn-count monitoring — not raw model capability or per-token pricing — are the primary levers controlling both quality and cost when running DeepSeek V4 Flash at scale.
- 01DeepSeek V4 Flash costs $0.0236/task on Fireworks; Haiku 4.5 costs $0.10; Sonnet 4.6 costs $0.296/task
- 02Flash scores 82.3 on the agentic eval (skill-augmented); Haiku 4.5 scores 82.9; Sonnet 4.6 scores 90.8
- 03Flash delivers 3,482 eval points per dollar — versus 829 for Haiku 4.5 and 303 for Sonnet 4.6
Rob Willoughby and Simon Maple of Tessl benchmarked 19 model configurations on real agentic tasks, measuring total token counts and charged provider pricing. Their headline finding inverts the expected "cheap models are a trap" narrative: DeepSeek V4 Flash costs $0.0236/task on Fireworks pricing and scores 82.3, while Haiku 4.5 costs $0.10 for a score of 82.9 and Sonnet 4.6 costs $0.296 for 90.8. On a points-per-dollar basis, Flash delivers 3,482 pts/$ versus Haiku's 829 and Sonnet's 303. Within the DeepSeek V4 family, Pro costs $0.183/task for a score of 85.3 — only 3 points above Flash at 7.7× the price, a difference that scales to $19,000/year extra at 10,000 tasks/month and $190,000/year at 100,000 tasks/month.
The article flags a critical blind spot in standard cost modeling: agent frameworks surface token counts by default, but turn counts are the variable that drives fat-tail cost explosions.
The article flags a critical blind spot in standard cost modeling: agent frameworks surface token counts by default, but turn counts are the variable that drives fat-tail cost explosions. Flash's mean is around 20 turns per task, but single worst-case runs in the dataset hit roughly 10× that figure. The authors recommend instrumenting agents for turns as well as tokens, knowing both the median and 95th-percentile turn counts, and setting timeout policies against the 95th percentile rather than the median.
A further caveat is that Flash's 82.3 score is skill-augmented — without a skill, Flash scores 64.1, meaning the skill contributes +18.2 points. That lift is conditional on the skill being precise, well-scoped, and relevant; a vague skill pulls performance back toward the 64.1 baseline. The article also notes that DeepSeek V4 Flash, GLM 5.1, and other non-GPT/Anthropic/Gemini models in the benchmark have publicly available weights, making self-hosting a viable path to near-zero marginal token cost for teams running at sufficient scale.
Key facts
- 01DeepSeek V4 Flash costs $0.0236/task on Fireworks; Haiku 4.5 costs $0.10; Sonnet 4.6 costs $0.296/task
- 02Flash scores 82.3 on the agentic eval (skill-augmented); Haiku 4.5 scores 82.9; Sonnet 4.6 scores 90.8
- 03Flash delivers 3,482 eval points per dollar — versus 829 for Haiku 4.5 and 303 for Sonnet 4.6
- 04DeepSeek V4 Pro costs $0.183/task and scores 85.3 — a 7.7× price premium over Flash for only 3 extra points
- 05The Pro vs. Flash cost gap compounds to $190,000/year extra at 100,000 tasks/month
- 06Flash's mean is ~20 turns per task, but worst-case runs hit roughly 10× that, creating fat-tail budget risk
- 07Without a skill, Flash scores 64.1; the skill adds +18.2 points to reach 82.3
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →