Cline's Ara Khan: use evals despite their flaws
Ara Khan of Cline argues that benchmark numbers aren't gospel and vibes aren't a system — the truth lies inconveniently in between, and a structured failure-analysis framework is the practical path forward.
Score breakdown
Treat eval score gains as a diagnostic signal rather than a leaderboard goal — Khan's three-zone failure-analysis framework gives AI/coding practitioners a concrete method for extracting actionable improvements from broken benchmarks without overfitting to them.
- 01Cline started at 43% on Terminal Bench.
- 02Score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering — not from switching models.
- 03Prompt engineering techniques were specific to Anthropic model families and did not transfer to Codex or Gemini.
Ara Khan of Cline opens with a provocation: benchmark numbers are not gospel, but vibes are not a system either. The honest answer, Khan argues, is inconveniently in between, and the talk is structured around making that middle ground actionable.
The gains had nothing to do with picking a stronger model, which challenges the common assumption that model choice is the primary lever.
Cline's own experience on Terminal Bench illustrates the point. Starting at 43%, the team's score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering techniques tailored specifically to Anthropic model families — none of which transferred to Codex or Gemini. The gains had nothing to do with picking a stronger model, which challenges the common assumption that model choice is the primary lever.
Khan's proposed framework for working with evals involves a post-run failure analysis: send another agent through all the failure traces to identify which small levers actually move the score. The failures are sorted into three zones. Zone one covers obvious bugs. Zone two covers nuanced, harness-specific improvements that explain why a widely praised model can still underperform in a particular eval setup. Zone three is benchmark overfitting — a real and common trap that Khan explicitly tells practitioners to avoid.
Key facts
- 01Cline started at 43% on Terminal Bench.
- 02Score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering — not from switching models.
- 03Prompt engineering techniques were specific to Anthropic model families and did not transfer to Codex or Gemini.
- 04Khan's framework uses a second agent to analyze all failure traces after an eval run.
- 05Failures are categorized into three zones: obvious bugs, nuanced harness-specific issues, and benchmark overfitting.
- 06Khan explicitly warns against overfitting to the benchmark (Zone three).
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 7, 2026 · 12:45 UTC. How this works →