Vibes-based testing ("it looked good in a demo") scales to about three pull requests. Past that you need evals: a repeatable way to measure whether the agent meets your quality bar on tasks that matter to you, and a way to notice when a model change regresses something.
Run the agent on a public benchmark and compare to the leaderboard. Useful for general capability signal; weak for your task signal.
| Benchmark | What it measures | Notes |
|---|---|---|
| HumanEval | Function-level Python code generation | Saturated — most frontier models ≥95%. Low signal now. |
| SWE-bench | Real-world GitHub issue → patch on open-source Python repos | The current benchmark most coding agents cite. |
| SWE-bench Verified | Human-audited subset of SWE-bench | Higher quality, lower variance. |
| TAU-bench | Multi-turn tool use (customer service simulation) | Tests agent loop + tool selection, not just code gen. |
| MBPP | Simple Python programs | Saturated like HumanEval. |
| BigCodeBench | Library usage across 139 Python libs | Harder than HumanEval; still useful. |
Use these to pick a model. Don't use them to ship.
The eval that matters: a fixed set of inputs from your domain with a programmatic grader.
Shape:
input → agent → output
↓
grader → pass/fail (or numeric score)Graders:
Sample real agent runs from production, score them (LLM-as-judge + human), turn the hard cases into new eval fixtures. This closes the loop.
import { readFileSync } from 'node:fs';
import { runAgent } from './your-agent';
const fixtures = JSON.parse(readFileSync('./evals/fixtures.json', 'utf8'));
const results = await Promise.all(
fixtures.map(async (f) => ({
id: f.id,
expected: f.expected,
actual: await runAgent(f.input),
passed: false,
})),
);
for (const r of results) {
r.passed = gradeOutput(r.actual, r.expected);
}
const passRate = results.filter((r) => r.passed).length / results.length;
console.log(
`Pass rate: ${(passRate * 100).toFixed(1)}% (${results.filter((r) => r.passed).length}/${results.length})`,
);
// Write full trace for post-hoc review
writeFileSync(`./evals/runs/${Date.now()}.json`, JSON.stringify(results, null, 2));Nothing fancy. 20 fixtures + a grader function is enough for a useful signal. Scale up as you grow.
The most common need: "did the new model / new prompt regress something?"
Tools that wrap this: Langfuse, PromptLayer, Braintrust, Inspect AI. Or build it yourself — the harness above plus a Git-tracked results dir gets you 80% of the way.
| Trigger | What to run |
|---|---|
| Prompt edit | Full task-specific eval suite |
| Model upgrade (Sonnet 4.5 → 4.6) | Full suite + benchmark (HumanEval / SWE-bench subset) |
| New tool / MCP server added | Subset that exercises tool use |
| Weekly cron | Smoke test — 5 fixtures, catches drift |
| Before shipping to prod | Full suite + sampled production replay |
Search for a command to run...