Code Mode cuts MCP tool-call token costs by 50–93%
Kuldeep Paul explains how Bifrost's Code Mode — where an agent writes an orchestration script instead of receiving full tool definitions in context — reduces MCP token costs by 50% to 93% depending on the number of connected servers and tools.
Score breakdown
Teams running production AI agents with many MCP servers can cut token costs by over 50% — and up to 93% at scale — by switching to Code Mode without sacrificing task accuracy.
- 01Standard MCP clients load every tool definition from every connected server into the model's context on every request, regardless of whether those tools are used.
- 02Intermediate results in multi-hop workflows (e.g., a meeting transcript passed between Google Drive and Salesforce) flow through the model's context multiple times — roughly 50,000 extra tokens per run in the cited example.
- 03Anthropic demonstrated a 98.7% token reduction on the Drive-to-Salesforce scenario, cutting consumption from 150,000 to 2,000 tokens.
Kuldeep Paul's article identifies two root causes of runaway MCP token costs at production scale. First, context-window bloat: standard MCP clients load every tool definition — descriptions, parameter schemas, and return-type metadata — from every connected server into the prompt on every single request. An agent wired to hundreds of tools burns through tokens before the user's message is even processed. Second, intermediate result shuttling: in multi-hop workflows, data passes through the model's context between each tool call. The canonical example given is a Google Drive-to-Salesforce workflow where a meeting transcript flows through context twice, consuming roughly 50,000 extra tokens on a single run.
Code Mode resolves both problems by inverting the execution pattern.
Code Mode resolves both problems by inverting the execution pattern. Instead of receiving a full tool catalog, the model writes a short orchestration script (in Python or a sandboxed Starlark variant) and the gateway executes it, keeping intermediate results inside the sandbox. Bifrost's implementation exposes four meta-tools — `listToolFiles`, `readToolFile`, `getToolDocs`, and `executeToolCode` — so the model can discover and inspect only the interfaces it actually needs. Anthropic and Cloudflare independently validated the same core approach: Cloudflare compressed 2,500+ API endpoints to two tools and around 1,000 tokens of surface area, while Anthropic demonstrated a 98.7% token reduction on the Drive-to-Salesforce scenario, cutting consumption from 150,000 to 2,000 tokens.
Bifrost's own benchmarks, run as 64 identical queries with Code Mode on versus off, show the savings compound sharply with scale: a 58% token drop at 96 tools across 6 servers, 84% at 251 tools across 11 servers, and 93% at 508 tools across 16 servers. Critically, all three configurations maintained a 100% task pass rate, meaning no accuracy was sacrificed for the cost reduction.
Key facts
- 01Standard MCP clients load every tool definition from every connected server into the model's context on every request, regardless of whether those tools are used.
- 02Intermediate results in multi-hop workflows (e.g., a meeting transcript passed between Google Drive and Salesforce) flow through the model's context multiple times — roughly 50,000 extra tokens per run in the cited example.
- 03Anthropic demonstrated a 98.7% token reduction on the Drive-to-Salesforce scenario, cutting consumption from 150,000 to 2,000 tokens.
- 04Cloudflare compressed 2,500+ API endpoints down to two tools and around 1,000 tokens of surface area using the same pattern.