Jun 6, 2026·1 min readOpinion & Analysis

Cline's Ara Khan: use evals despite their flaws

Ara Khan of Cline argues that benchmark numbers aren't gospel and vibes aren't a system — the truth lies inconveniently in between, and a structured failure-analysis framework is the practical path forward.

YouTube: AI Engineer·AI Engineer

Read at source

Composite

5.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Treat eval score gains as a diagnostic signal rather than a leaderboard goal — Khan's three-zone failure-analysis framework gives AI/coding practitioners a concrete method for extracting actionable improvements from broken benchmarks without overfitting to them.

01Cline started at 43% on Terminal Bench.
02Score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering — not from switching models.
03Prompt engineering techniques were specific to Anthropic model families and did not transfer to Codex or Gemini.

Summary— our read of the original

Ara Khan of Cline opens with a provocation: benchmark numbers are not gospel, but vibes are not a system either. The honest answer, Khan argues, is inconveniently in between, and the talk is structured around making that middle ground actionable.

The gains had nothing to do with picking a stronger model, which challenges the common assumption that model choice is the primary lever.

Cline's own experience on Terminal Bench illustrates the point. Starting at 43%, the team's score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering techniques tailored specifically to Anthropic model families — none of which transferred to Codex or Gemini. The gains had nothing to do with picking a stronger model, which challenges the common assumption that model choice is the primary lever.

Khan's proposed framework for working with evals involves a post-run failure analysis: send another agent through all the failure traces to identify which small levers actually move the score. The failures are sorted into three zones. Zone one covers obvious bugs. Zone two covers nuanced, harness-specific improvements that explain why a widely praised model can still underperform in a particular eval setup. Zone three is benchmark overfitting — a real and common trap that Khan explicitly tells practitioners to avoid.

Key facts

01Cline started at 43% on Terminal Bench.
02Score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering — not from switching models.
03Prompt engineering techniques were specific to Anthropic model families and did not transfer to Codex or Gemini.
04Khan's framework uses a second agent to analyze all failure traces after an eval run.
05Failures are categorized into three zones: obvious bugs, nuanced harness-specific issues, and benchmark overfitting.
06Khan explicitly warns against overfitting to the benchmark (Zone three).

Topics

#benchmarks #evals #agent-framework #prompt-engineering

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 7, 2026 · 12:45 UTC. How this works →

Jun 6, 2026·1 min readOpinion & Analysis

Cline's Ara Khan: use evals despite their flaws

YouTube: AI Engineer·AI Engineer

Read at source

Composite

5.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Cline started at 43% on Terminal Bench.
02Score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering — not from switching models.
03Prompt engineering techniques were specific to Anthropic model families and did not transfer to Codex or Gemini.

Summary— our read of the original

The gains had nothing to do with picking a stronger model, which challenges the common assumption that model choice is the primary lever.

Key facts

01Cline started at 43% on Terminal Bench.
02Score improvements came from container CPU and memory settings, raised timeouts, and prompt engineering — not from switching models.
03Prompt engineering techniques were specific to Anthropic model families and did not transfer to Codex or Gemini.
04Khan's framework uses a second agent to analyze all failure traces after an eval run.
05Failures are categorized into three zones: obvious bugs, nuanced harness-specific issues, and benchmark overfitting.
06Khan explicitly warns against overfitting to the benchmark (Zone three).

Topics

#benchmarks #evals #agent-framework #prompt-engineering

Methodology

Score breakdown

Key facts

Topics

More in Opinion & Analysis.

Score breakdown

Key facts

Topics

More in Opinion & Analysis.