Jun 11, 2026·1 min readResearch Papers

Self-Inspect MCP surfaces 3.5x more agent assumptions, no correctness gain

An eval of the Self-Inspect MCP — which injects a single metathought mid-task — showed coding agents surfaced ~3.5x more decision forks (14/30 vs ~4/30 turns) without improving correctness on a well-specified task.

r/mcp·u/frank_brsrk

Read at source

Composite

4.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The eval concretely separates two effects of the Self-Inspect MCP: it reliably increases the visibility of silent agent assumptions mid-task, but does not improve correctness when the task is already well-specified — clarifying where the tool does and does not add value.

01Tool condition: 14/30 turns surfaced a decision fork in both runs; baseline: 3/30 and 5/30 turns — approximately 3.5x more assumptions surfaced.
02Agents used: Claude Sonnet 4.6, building a usage-billing module over a fixed 30-turn conversation with byte-identical base prompts.
03At turn 9, the tool agent flagged a snapshot-averaging bug (averaging daily storage snapshots over a month breaks when there are fewer snapshots than days); the baseline missed it.

Summary— our read of the original

u/frank_brsrk published eval results for Self-Inspect MCP, a tool that injects a single "metathought" mid-task to surface an agent's silent assumptions. The experiment pitted two Claude Sonnet 4.6 coding agents against each other, both building the same usage-billing module over a fixed 30-turn conversation with byte-identical base prompts. The only difference was that one agent called Self-Inspect once per turn. Turns were scored on whether the agent's reply surfaced a decision fork — an assumption, precondition, edge case, or risk raised explicitly rather than silently resolved.

The tool condition surfaced forks in 14/30 turns in both runs, compared to 3/30 and 5/30 for the baseline — approximately a 3.5x increase, described as consistent within each condition.

The tool condition surfaced forks in 14/30 turns in both runs, compared to 3/30 and 5/30 for the baseline — approximately a 3.5x increase, described as consistent within each condition. One concrete example: at turn 9, the tool agent flagged a snapshot-averaging bug (averaging daily storage snapshots over a month breaks when there are fewer snapshots than days), which the baseline agent shipped past. Another example showed the metathought question arriving precisely as the agent was about to persist state, prompting it to surface the assumption that "persistence lives outside the module."

Correctness, however, did not move. Both conditions caught the planted contradictions and produced correct designs. The post concludes that on a fully-specified task a capable model already self-checks, so the tool's effect is on process legibility — making assumptions visible — rather than on final output quality. The eval is deterministic (no LLM involved in scoring, open CSV), and the author verified calls were real by re-sending logged thoughts and receiving identical metathoughts. Raw data, scorer, methodology, and a one-command reproduction are published at the project's GitHub repository.

Key facts

01Tool condition: 14/30 turns surfaced a decision fork in both runs; baseline: 3/30 and 5/30 turns — approximately 3.5x more assumptions surfaced.
02Agents used: Claude Sonnet 4.6, building a usage-billing module over a fixed 30-turn conversation with byte-identical base prompts.
03At turn 9, the tool agent flagged a snapshot-averaging bug (averaging daily storage snapshots over a month breaks when there are fewer snapshots than days); the baseline missed it.
04Correctness was unchanged: both conditions caught planted contradictions and shipped correct designs.
05The eval is deterministic — no LLM in the scorer, open CSV — and was verified by re-sending logged thoughts to confirm identical metathoughts.
06Self-Inspect MCP returns one metathought per call describing what the agent is currently doing; it is installable via `npx -y self-inspect-mcp`.
07Raw data (4 conversation logs), scorer, methodology, and one-command reproduction are available on GitHub.

Topics

#mcp #agent-framework #tool-use #benchmarks #reasoning

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →

Jun 11, 2026·1 min readResearch Papers

Self-Inspect MCP surfaces 3.5x more agent assumptions, no correctness gain

r/mcp·u/frank_brsrk

Read at source

Composite

4.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Tool condition: 14/30 turns surfaced a decision fork in both runs; baseline: 3/30 and 5/30 turns — approximately 3.5x more assumptions surfaced.
02Agents used: Claude Sonnet 4.6, building a usage-billing module over a fixed 30-turn conversation with byte-identical base prompts.
03At turn 9, the tool agent flagged a snapshot-averaging bug (averaging daily storage snapshots over a month breaks when there are fewer snapshots than days); the baseline missed it.

Summary— our read of the original

The tool condition surfaced forks in 14/30 turns in both runs, compared to 3/30 and 5/30 for the baseline — approximately a 3.5x increase, described as consistent within each condition.

Key facts

01Tool condition: 14/30 turns surfaced a decision fork in both runs; baseline: 3/30 and 5/30 turns — approximately 3.5x more assumptions surfaced.
02Agents used: Claude Sonnet 4.6, building a usage-billing module over a fixed 30-turn conversation with byte-identical base prompts.
03At turn 9, the tool agent flagged a snapshot-averaging bug (averaging daily storage snapshots over a month breaks when there are fewer snapshots than days); the baseline missed it.
04Correctness was unchanged: both conditions caught planted contradictions and shipped correct designs.
05The eval is deterministic — no LLM in the scorer, open CSV — and was verified by re-sending logged thoughts to confirm identical metathoughts.
06Self-Inspect MCP returns one metathought per call describing what the agent is currently doing; it is installable via `npx -y self-inspect-mcp`.
07Raw data (4 conversation logs), scorer, methodology, and one-command reproduction are available on GitHub.

Topics

#mcp #agent-framework #tool-use #benchmarks #reasoning

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.