Iris MCP server gives coding agents pass/fail verdicts on real app state
u/hack_the_developer built Iris, an MCP server that runs inside a live app and returns structured pass/fail verdicts with evidence to coding agents like Claude Code, instead of leaving them to interpret browser snapshots.
Score breakdown
Iris replaces the agent's need to interpret a browser snapshot with a direct pass/fail verdict from inside the live app, addressing the failure mode where agents incorrectly self-report completion without confirming actual runtime behavior.
- 01Claude Code shipped a 401 error on a payment endpoint and reported the task done; the bug went undetected for three days.
- 02Iris is an MCP server that runs inside the real running app and returns a structured pass/fail verdict with evidence to the agent.
- 03Agents call `iris_assert()` with conditions (e.g., net 200, console clean, signal fired); Iris returns `{ pass: false, evidence: [...] }` with the actual value and `file:line` to fix.
u/hack_the_developer describes a concrete failure that motivated the project: Claude Code returned a 401 on a payment endpoint and reported the task as complete, and the bug went unnoticed for three days. The response was Iris, an MCP server designed to close the verification gap by running inside the real application rather than alongside it in a separate browser context.
Iris evaluates those conditions against the live app and returns a structured object: `{ pass: false, evidence: [...] }`, including what failed, the actual observed value, and the exact `file:line` location to address.
The core mechanic is `iris_assert()`, which an agent calls with a set of conditions — for example, that a network request returns HTTP 200, that the console is clean, or that a specific signal has fired. Iris evaluates those conditions against the live app and returns a structured object: `{ pass: false, evidence: [...] }`, including what failed, the actual observed value, and the exact `file:line` location to address. This is explicitly contrasted with tools like Playwright MCP, which drive a separate browser and return a snapshot the agent must still interpret; the post argues Iris returns a verdict, not a snapshot.
On token efficiency, the post reports a 73× reduction versus a full-tree snapshot on what it calls "the common loop" — approximately 100 tokens compared to 6,856 — while acknowledging that a full-tree-to-full-tree comparison narrows to only ~1.8×. Iris is released under the MIT license, scoped to dev-only and localhost-only use, and is available as `npm i -D @syrin/iris`.
Key facts
- 01Claude Code shipped a 401 error on a payment endpoint and reported the task done; the bug went undetected for three days.
- 02Iris is an MCP server that runs inside the real running app and returns a structured pass/fail verdict with evidence to the agent.
- 03Agents call `iris_assert()` with conditions (e.g., net 200, console clean, signal fired); Iris returns `{ pass: false, evidence: [...] }` with the actual value and `file:line` to fix.
- 04Token benchmark: ~100 tokens vs. ~6,856 for a full-tree snapshot on the common loop — a claimed 73× reduction.
- 05Full-tree-to-full-tree comparison yields only ~1.8× token savings, a number the post explicitly does not hide.
- 06Iris is MIT-licensed, dev-only, and localhost-only; install via `npm i -D @syrin/iris`.
- 07The post distinguishes Iris from Playwright MCP, which drives a separate browser and returns a snapshot the agent must still interpret.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →