Anthropic prompt caching: cache_control markers and break-even math explained
Ravi Patel's Dev.to post breaks down Anthropic's prompt caching mechanics, including the `cache_control` marker syntax, a two-tier write premium (1.25x for 5-minute TTL, 2x for 1-hour TTL), and a 0.10x read price that delivers up to a 90% discount on cached input tokens.
Score breakdown
The post's detailed break-even tables make concrete when each TTL tier actually reduces costs versus increases them, giving developers a practical framework for deciding which TTL to use based on their request frequency.
- 01Prompt caching requires explicit opt-in via `cache_control: { type: "ephemeral" }` on a content block; it does not fire automatically the way OpenAI's caching does.
- 02Cache write premium is 1.25x normal input price for the 5-minute TTL and 2x for the 1-hour TTL.
- 03Cache reads cost 0.10x normal input price — a 90% discount — for both TTL tiers.
Ravi Patel's post describes Anthropic's prompt caching as one of the highest-ROI LLM cost-reduction techniques available, while noting that its mechanics are not immediately obvious from the documentation. The system works by caching the model's internal attention state for a stable prompt prefix — not the response itself — so that the expensive prefix-attention computation is skipped on subsequent requests. Developers must explicitly tag the stable portion of a prompt with `cache_control: { type: "ephemeral" }` on a content block; the cache key is the byte-exact content of everything up to and including that marker, meaning any single-character change invalidates the cache.
A 5-minute TTL write costs 1.25x the normal input price, and a 1-hour TTL write costs 2x.
The pricing structure has two tiers. A 5-minute TTL write costs 1.25x the normal input price, and a 1-hour TTL write costs 2x. Both TTLs share the same 0.10x read price on cache hits. The post provides detailed break-even tables: the 5-minute TTL nets a saving at the second hit (average cost drops to 0.675x, a 32.5% saving), while the 1-hour TTL requires approximately three hits before it becomes cheaper than uncached (0.733x, a 27% saving). At steady state with a warm cache, both TTLs asymptotically approach the 0.10x read price — a 90% discount on the cached portion. Output tokens remain at full price throughout. The post also notes that multiple `cache_control` markers can be placed on nested content blocks to cache layered prefix levels, and that the cache key is sensitive to model parameters and tool definitions in addition to prompt text.
Key facts
- 01Prompt caching requires explicit opt-in via `cache_control: { type: "ephemeral" }` on a content block; it does not fire automatically the way OpenAI's caching does.
- 02Cache write premium is 1.25x normal input price for the 5-minute TTL and 2x for the 1-hour TTL.
- 03Cache reads cost 0.10x normal input price — a 90% discount — for both TTL tiers.
- 04The 5-minute TTL reaches break-even at the second cache hit (average cost 0.675x, a 32.5% saving).
- 05The 1-hour TTL needs roughly three hits to net out (0.733x average, a 27% saving) but survives 12x longer between requests than the 5-minute TTL.
- 06The cache stores the model's internal prefix-attention state, not the response; the model still generates output token-by-token.
- 07The cache key is byte-exact: any change to the prompt content, model parameter, or tool definition invalidates the cache.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 14, 2026 · 09:08 UTC. How this works →