Evaluation

Last verified: 2026-04-17 · next review in 118 days

Vibes-based testing ("it looked good in a demo") scales to about three pull requests. Past that you need evals: a repeatable way to measure whether the agent meets your quality bar on tasks that matter to you, and a way to notice when a model change regresses something.

Three classes of eval

1. Off-the-shelf benchmarks

Run the agent on a public benchmark and compare to the leaderboard. Useful for general capability signal; weak for your task signal.

Benchmark	What it measures	Notes
HumanEval	Function-level Python code generation	Saturated — most frontier models ≥95%. Low signal now.
SWE-bench	Real-world GitHub issue → patch on open-source Python repos	The current benchmark most coding agents cite.
SWE-bench Verified	Human-audited subset of SWE-bench	Higher quality, lower variance.
TAU-bench	Multi-turn tool use (customer service simulation)	Tests agent loop + tool selection, not just code gen.
MBPP	Simple Python programs	Saturated like HumanEval.
BigCodeBench	Library usage across 139 Python libs	Harder than HumanEval; still useful.

Use these to pick a model. Don't use them to ship.

2. Task-specific evals (your own)

The eval that matters: a fixed set of inputs from your domain with a programmatic grader.

Shape:

input        → agent → output
              ↓
           grader  → pass/fail  (or numeric score)

Graders:

Exact match — output must equal expected. Useful for structured-output tasks (classification, extraction).
Unit test pass — for code tasks: the agent's diff must make a failing test pass without breaking others.
LLM-as-judge — a second LLM scores the output on a rubric. Useful when "correct" is subjective. Noisier than other graders, needs calibration.
Human review — gold standard, slow. Use for the hardest subset.

3. Production traces

Sample real agent runs from production, score them (LLM-as-judge + human), turn the hard cases into new eval fixtures. This closes the loop.

A minimal eval harness

import { readFileSync } from 'node:fs';
import { runAgent } from './your-agent';

const fixtures = JSON.parse(readFileSync('./evals/fixtures.json', 'utf8'));

const results = await Promise.all(
  fixtures.map(async (f) => ({
    id: f.id,
    expected: f.expected,
    actual: await runAgent(f.input),
    passed: false,
  })),
);

for (const r of results) {
  r.passed = gradeOutput(r.actual, r.expected);
}

const passRate = results.filter((r) => r.passed).length / results.length;
console.log(
  `Pass rate: ${(passRate * 100).toFixed(1)}% (${results.filter((r) => r.passed).length}/${results.length})`,
);

// Write full trace for post-hoc review
writeFileSync(`./evals/runs/${Date.now()}.json`, JSON.stringify(results, null, 2));

Nothing fancy. 20 fixtures + a grader function is enough for a useful signal. Scale up as you grow.

Regression detection

The most common need: "did the new model / new prompt regress something?"

Versioned eval runs. Every run is tagged with model ID, prompt version, fixture version. Store results in a DB or JSONL.
Paired comparison. For each fixture, compare old-run vs new-run. Anything that flipped pass→fail is a regression; fail→pass is a gain; fail→different-fail is drift.
Flaky-fixture detection. Run the same fixture N times; if pass rate is between 40% and 80%, the fixture is flaky — fix or remove.

Tools that wrap this: Langfuse, PromptLayer, Braintrust, Inspect AI. Or build it yourself — the harness above plus a Git-tracked results dir gets you 80% of the way.

When to run evals

Trigger	What to run
Prompt edit	Full task-specific eval suite
Model upgrade (Sonnet 4.5 → 4.6)	Full suite + benchmark (HumanEval / SWE-bench subset)
New tool / MCP server added	Subset that exercises tool use
Weekly cron	Smoke test — 5 fixtures, catches drift
Before shipping to prod	Full suite + sampled production replay

Anti-patterns

Writing the eval after you've shipped. The eval exists to catch regressions; if you write it after shipping, it's a post-hoc rationalization.
Over-relying on benchmarks. A model can ace SWE-bench and fail your customers' tasks.
LLM-as-judge without calibration. Different judges give different scores for the same output. Always ground against human judgments on a subset.
No flakiness handling. If your eval is stochastic, you need N-run averages, not single-shot pass/fail.
Evals that don't block deploys. An eval that's never enforced is just a dashboard.

On this page

Evaluation

Three classes of eval

1. Off-the-shelf benchmarks

2. Task-specific evals (your own)

3. Production traces

A minimal eval harness

Regression detection

When to run evals

Anti-patterns

Further reading

On this page

On this page

Evaluation

Three classes of eval

1. Off-the-shelf benchmarks

2. Task-specific evals (your own)

3. Production traces

A minimal eval harness

Regression detection

When to run evals

Anti-patterns

Further reading

On this page