Prompt structure, not model choice, drives AI agent cost overruns
Serhii Panchyshyn explains how re-sending stable prompt content on every agent loop iteration causes costs to multiply by 5–7x, and how proper prompt ordering for prefix caching can cut input costs by roughly 80%.
Score breakdown
Developers building multi-step agentic pipelines can cut LLM input costs by a large multiple — not just a percentage — by auditing prompt structure and ensuring stable content is left-anchored before any variable or loop-generated content.
- 01Re-sending stable prompt content across 5–7 agent loop iterations multiplies costs by 5–7x, not just a percentage.
- 02A medium-resolution image costs ~2,000 tokens; re-sent across 7 passes, that's ~14,000 tokens billed for content seen once.
- 03Restructuring prompt layout cut the author's input costs by roughly 80% in about 30 minutes.
Serhii Panchyshyn identifies the agentic loop as a cost multiplier that most developers overlook. When an agent calls a model five to seven times per user interaction, every iteration re-bills the entire prompt — including static system instructions, tool definitions, and even the original user message. Panchyshyn describes a personal example where a medium-resolution image (~2,000 tokens) was re-sent across seven passes, billing roughly 14,000 tokens for content the model only needed once. A 30-minute fix cut his input costs by roughly 80%. He contrasts this with the more common optimization instinct — switching to a smaller model or trimming the system prompt — which yields linear, percentage-based savings, whereas fixing the loop structure yields multiplicative savings proportional to iteration count.
The mechanism behind the fix is prefix caching, now offered by all major LLM providers.
The mechanism behind the fix is prefix caching, now offered by all major LLM providers. OpenAI offers up to a 90% discount on cached tokens; Anthropic's cache reads cost about 10% of normal input cost. Both are left-anchored prefix caches: the cache matches from byte zero forward, and any divergence ends the match entirely. This means prompt layout is not cosmetic — it is economic. The correct structure is `[STABLE PREFIX][cache breakpoint][GROWING TAIL]`, with everything static pushed as far left as possible. Panchyshyn uses LightRAG's entity extraction pipeline as a concrete anti-pattern: the framework embeds variable `{input_text}` directly inside the system prompt alongside ~1,300 tokens of static instructions, breaking the shared prefix on every chunk. For an 8,000-chunk indexing run, this results in roughly 11.6 million tokens billed as new; separating the variable input into the user message would allow ~10.4 million of those tokens to hit the cache — a 45% cost reduction from three lines of code.
For multi-turn agentic loops specifically, Panchyshyn notes that the user's message should not sit at the end of the prompt. Because the user's question is fixed across all iterations while tool results and assistant responses accumulate, the user message belongs in the stable prefix. The recommended ordering — drawn from the Claude Code team at Anthropic — is: static system prompt and tool definitions first, then project-level context, then session context, with the growing conversation tail last. Each layer is stable relative to the one below it, allowing cache hits to cascade across layers.
Key facts
- 01Re-sending stable prompt content across 5–7 agent loop iterations multiplies costs by 5–7x, not just a percentage.
- 02A medium-resolution image costs ~2,000 tokens; re-sent across 7 passes, that's ~14,000 tokens billed for content seen once.