Apr 21, 2026·2 min readAgentic Coding

Prompt structure, not model choice, drives AI agent cost overruns

Serhii Panchyshyn explains how re-sending stable prompt content on every agent loop iteration causes costs to multiply by 5–7x, and how proper prompt ordering for prefix caching can cut input costs by roughly 80%.

Dev.to #ai·Serhii Panchyshyn

Read at source

Composite

5.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers building multi-step agentic pipelines can cut LLM input costs by a large multiple — not just a percentage — by auditing prompt structure and ensuring stable content is left-anchored before any variable or loop-generated content.

01Re-sending stable prompt content across 5–7 agent loop iterations multiplies costs by 5–7x, not just a percentage.
02A medium-resolution image costs ~2,000 tokens; re-sent across 7 passes, that's ~14,000 tokens billed for content seen once.
03Restructuring prompt layout cut the author's input costs by roughly 80% in about 30 minutes.

Summary— our read of the original

Serhii Panchyshyn identifies the agentic loop as a cost multiplier that most developers overlook. When an agent calls a model five to seven times per user interaction, every iteration re-bills the entire prompt — including static system instructions, tool definitions, and even the original user message. Panchyshyn describes a personal example where a medium-resolution image (~2,000 tokens) was re-sent across seven passes, billing roughly 14,000 tokens for content the model only needed once. A 30-minute fix cut his input costs by roughly 80%. He contrasts this with the more common optimization instinct — switching to a smaller model or trimming the system prompt — which yields linear, percentage-based savings, whereas fixing the loop structure yields multiplicative savings proportional to iteration count.

The mechanism behind the fix is prefix caching, now offered by all major LLM providers.

The mechanism behind the fix is prefix caching, now offered by all major LLM providers. OpenAI offers up to a 90% discount on cached tokens; Anthropic's cache reads cost about 10% of normal input cost. Both are left-anchored prefix caches: the cache matches from byte zero forward, and any divergence ends the match entirely. This means prompt layout is not cosmetic — it is economic. The correct structure is `[STABLE PREFIX][cache breakpoint][GROWING TAIL]`, with everything static pushed as far left as possible. Panchyshyn uses LightRAG's entity extraction pipeline as a concrete anti-pattern: the framework embeds variable `{input_text}` directly inside the system prompt alongside ~1,300 tokens of static instructions, breaking the shared prefix on every chunk. For an 8,000-chunk indexing run, this results in roughly 11.6 million tokens billed as new; separating the variable input into the user message would allow ~10.4 million of those tokens to hit the cache — a 45% cost reduction from three lines of code.

For multi-turn agentic loops specifically, Panchyshyn notes that the user's message should not sit at the end of the prompt. Because the user's question is fixed across all iterations while tool results and assistant responses accumulate, the user message belongs in the stable prefix. The recommended ordering — drawn from the Claude Code team at Anthropic — is: static system prompt and tool definitions first, then project-level context, then session context, with the growing conversation tail last. Each layer is stable relative to the one below it, allowing cache hits to cascade across layers.

Key facts

01Re-sending stable prompt content across 5–7 agent loop iterations multiplies costs by 5–7x, not just a percentage.
02A medium-resolution image costs ~2,000 tokens; re-sent across 7 passes, that's ~14,000 tokens billed for content seen once.
03Restructuring prompt layout cut the author's input costs by roughly 80% in about 30 minutes.
04OpenAI offers up to a 90% discount on cached tokens; Anthropic's cache reads cost ~10% of normal input cost.
05Prefix caches are left-anchored: any byte divergence from the stored prefix ends the cache match immediately.
06LightRAG's entity extraction pipeline embeds variable `{input_text}` in the system prompt, breaking the shared prefix across all chunks — a 45% cost reduction is achievable by moving it to the user message.
07The Claude Code team's recommended prompt ordering: static system prompt + tool definitions → project context → session context → conversation tail (growing).

Topics

#prompt-engineering #cost-optimization #agent-framework #developer-tools #llm-efficiency

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 21, 2026 · 18:16 UTC. How this works →

Apr 21, 2026·2 min readAgentic Coding

Prompt structure, not model choice, drives AI agent cost overruns

Dev.to #ai·Serhii Panchyshyn

Read at source

Composite

5.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Re-sending stable prompt content across 5–7 agent loop iterations multiplies costs by 5–7x, not just a percentage.
02A medium-resolution image costs ~2,000 tokens; re-sent across 7 passes, that's ~14,000 tokens billed for content seen once.
03Restructuring prompt layout cut the author's input costs by roughly 80% in about 30 minutes.

Summary— our read of the original

The mechanism behind the fix is prefix caching, now offered by all major LLM providers.

Key facts

01Re-sending stable prompt content across 5–7 agent loop iterations multiplies costs by 5–7x, not just a percentage.
02A medium-resolution image costs ~2,000 tokens; re-sent across 7 passes, that's ~14,000 tokens billed for content seen once.
03Restructuring prompt layout cut the author's input costs by roughly 80% in about 30 minutes.
04OpenAI offers up to a 90% discount on cached tokens; Anthropic's cache reads cost ~10% of normal input cost.
05Prefix caches are left-anchored: any byte divergence from the stored prefix ends the cache match immediately.
06LightRAG's entity extraction pipeline embeds variable `{input_text}` in the system prompt, breaking the shared prefix across all chunks — a 45% cost reduction is achievable by moving it to the user message.
07The Claude Code team's recommended prompt ordering: static system prompt + tool definitions → project context → session context → conversation tail (growing).

Topics

#prompt-engineering #cost-optimization #agent-framework #developer-tools #llm-efficiency

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics