Engineering

Prompt Caching: How We Cut Claude API Costs by 90% in Production

2026-05-17 Updated 2026-05-27 8 min read

Prompt caching is the single highest-leverage optimization for any production system running on Claude — done right, it cuts input token costs by ~90% and trims latency by up to 80% on cached prefixes. We use it on every agentic system we ship, and on long-running workloads it's the difference between a margin-positive product and one that quietly burns the runway.

This post is the version of prompt caching we wish existed when we started: what it actually is at the protocol level, where the sharp edges are, the four mistakes that silently kill your cache hit rate, and the architectural patterns we use across our AI integration services work to make caching the default rather than an afterthought.

What prompt caching actually is (at the protocol level)

Prompt caching lets you mark a stable prefix of your prompt — system instructions, tool definitions, RAG context, conversation history — so Anthropic stores the model's internal representation of those tokens server-side. On subsequent requests with the same prefix, the model skips re-processing it. You pay roughly 10% of the normal input price for cached reads, and the prefix loads almost instantly.

The mechanism is a single field, cache_control: {"type": "ephemeral"}, attached to a content block. Up to 4 cache breakpoints per request. Cache writes cost 1.25x normal input tokens, cache reads cost 0.1x. The default TTL is 5 minutes, refreshed on every hit; a 1-hour TTL is available at 2x the write surcharge.

Prompt caching operates at the model-prefix level — it's distinct from caching the assembled output of a chain of model calls, which is a different problem we cover in how to cache an expensive multi-step LLM pipeline with a two-layer semantic caching pattern. The reason this matters: every modern agentic workload reuses 90%+ of its prompt across turns. System prompt, tool schemas, knowledge base context, prior conversation — all stable. If you're not caching that, you're paying 10x what you should be on every turn after the first.

Why the 90% number is real, but conditional

The "90% cost reduction" headline is technically accurate but practically misleading. It's 90% off the cached portion of the prompt, not 90% off your total bill. If your cached prefix is 20k tokens and your fresh content per turn is 18k tokens, you'll see a much smaller real-world saving.

On our agentic builds the cached prefix is usually 80–95% of the total prompt — tool definitions alone often run 5–15k tokens, plus system instructions, plus dynamic context we deliberately pin into cacheable position. That's where the real bill collapse happens. On one production agentic system we ship, the average per-turn input cost dropped from $0.042 to $0.0048 after we restructured the prompt for cacheability — an 88% reduction sustained over weeks of production traffic.

The most useful internal metric we track is cache hit ratio per session: cached read tokens divided by (cached read + cache write + uncached input). Anything below 70% on a long-running agent means the prompt is structured wrong.

The four mistakes that silently kill your cache hit rate

This is where most teams lose. Caching looks like it's working — the API returns 200s, the responses look fine — but the cache_read_input_tokens field in the response is near zero and your bill keeps climbing. We've debugged this exact failure on enough projects that we now treat it as the default state until proven otherwise.

Mistake 1: Variable content above stable content. Caching matches prefixes, not arbitrary substrings. If you put today's date or the user's name at the top of the system prompt, every single request is a fresh prefix. Stable content (tool defs, system instructions) must come first; dynamic content goes at the end.

Mistake 2: Marking the wrong breakpoint. You don't cache "the message" — you cache "everything up to and including this content block." Putting cache_control on a frequently-changing block means you're writing a fresh cache every turn and paying the 1.25x write surcharge for nothing. Cache at the boundary between stable and dynamic, not inside the dynamic part.

Mistake 3: Falling below the minimum cacheable size. Anthropic requires 1024 tokens for Sonnet/Opus and 2048 for Haiku before a cache breakpoint is honored. Below that, the API silently ignores the directive. We've seen teams cache a 500-token system prompt and wonder why their hit rate is zero.

Mistake 4: Letting the 5-minute TTL expire between requests. If your traffic is bursty — a few requests then 10 minutes of silence — your cache evaporates and you pay the write surcharge again. For workloads with gappy traffic, the 1-hour TTL almost always pays for itself, despite the 2x write cost. Rate-limit back-pressure under concurrent load is one of the four reasons an AI agent that works in staging will degrade in production after a few days — when caching collapses, latency spikes, and degraded retrieval starts feeding the model worse context.

How we structure prompts for maximum cache hit rate

The pattern we use on every Claude-backed system looks like this, conceptually:

System block:
  1. Static identity / role instructions       [CACHE]
  2. Tool definitions (full JSON schemas)      [CACHE]
  3. Stable retrieved context / KB chunks      [CACHE breakpoint]
  4. Session-specific context (user profile)   [CACHE breakpoint]
  5. Current turn / dynamic input              (uncached)

We place cache breakpoints at the boundaries where content changes. Up to 4 breakpoints means you can have multiple cache layers — the outermost layer (identity + tools) almost never invalidates, the middle layer (knowledge base context) invalidates per project, the inner layer (session context) invalidates per user. Each layer benefits from cached reads even when an inner layer changes.

The decision we made early — and still defend — is that we'd rather pay a small latency cost to re-order the prompt for cacheability than save engineering time by structuring it the "natural" way. The economics aren't close. A 30-minute prompt restructure on a system doing 10k turns/day pays for itself in the first 12 hours.

When to use the 1-hour TTL vs the default 5-minute

We default to 5-minute TTL for chat-style and high-frequency agent workloads where requests come in continuously. The cache stays warm via traffic, and the cheaper write cost wins.

We switch to 1-hour TTL when traffic is bursty, batched, or low-frequency — for example, a back-office automation that runs every 15 minutes against the same RAG context, or an internal tool used a few times an hour. The 1-hour TTL costs 2x the standard write surcharge, but if it saves you even one re-write per hour, it's already cheaper than the alternative.

The math: 5-min TTL cache write = 1.25x input cost. 1-hour TTL cache write = 2x input cost. If you'd otherwise write the cache 4 times in an hour at 1.25x (total: 5x), one 1-hour write at 2x is a 60% saving on cache writes alone. We've watched this single TTL change cut weekly API spend by 40% on the right workload.

The monitoring pattern that catches cache regressions

Cache hit rate degrades silently. A teammate adds a timestamp to the system prompt, or reorders a tool definition, and your bill quietly doubles overnight. The only defense is monitoring.

The response object includes cache_creation_input_tokens and cache_read_input_tokens on every call. We log these to a metrics pipeline and alert on two conditions: cache hit ratio drops below a per-system threshold (usually 75%), or cache write tokens spike beyond a baseline. Either signal means someone broke the prefix.

This monitoring is non-negotiable on any system we run for clients. It's also one of the patterns we discuss in our writeup on context-aware KB cascade design — caching and context structure are the same problem viewed from two angles. The same discipline applies to output quality: a well-tuned cache is worthless if the model is silently regressing on critical responses, which is why we pair every caching deployment with an LLM eval suite that catches production regressions before they reach users.

When prompt caching is not the answer

Prompt caching isn't universally a win. Three cases where we skip it or use it sparingly:

One: low-volume workloads. If you're making 5 calls per hour and each prefix is 2k tokens, the write surcharge eats any saving. The cost crossover is roughly 3–4 cache hits within the TTL window — below that, vanilla input pricing wins.

Two: highly variable prompts. If genuinely every request is structurally different (e.g. one-shot document analysis where the document is the prompt), there's no stable prefix to cache. Better to focus optimization elsewhere — model selection, output length capping, parallel batching.

Three: very small prompts. The 1024/2048 token minimum kills caching for compact prompts. We frequently see teams try to cache a 600-token system prompt and assume the API is broken; it's just below threshold.

The pattern at a glance

Prompt caching turns Claude from "expensive per-token model" into "cheap after the first turn." The discipline is structural: put stable content first, mark breakpoints at change boundaries, hit the minimum token threshold, pick the right TTL for your traffic shape, and monitor hit ratio like you'd monitor uptime. Done correctly, an agentic system that costs $4,000/month at standard pricing runs at $400–600/month — and the latency improvement is a free bonus on top. This is the kind of optimization that compounds: every prompt-caching-aware system we ship pays for its own engineering cost within the first month of production traffic.

Frequently Asked Questions

How much does prompt caching actually save on a real Claude API bill?

Cached reads are billed at ~10% of standard input pricing, so the theoretical max is 90% off the cached portion. In production agentic workloads where 80-95% of the prompt is the same across turns, we typically see 85-90% sustained reduction in total input costs. On one production system we ship, average per-turn input cost dropped from $0.042 to $0.0048 — an 88% reduction held over weeks of traffic.

What is the minimum prompt size for Claude prompt caching to work?

Anthropic requires at least 1024 tokens before a cache breakpoint on Sonnet and Opus models, and 2048 tokens on Haiku. If the cached prefix is shorter than the threshold, the API silently ignores the cache_control directive and you get zero hits. Below that size the math also rarely justifies caching anyway.

Should I use the 5-minute or the 1-hour TTL for prompt caching?

Use the 5-minute TTL (the default) for chat-style or high-frequency workloads where traffic keeps the cache warm. Use the 1-hour TTL for bursty, batched, or low-frequency workloads — back-office automations, scheduled jobs, internal tools. The 1-hour TTL costs 2x the write surcharge instead of 1.25x, but if it avoids even one re-write per hour it's already cheaper.

Why is my Claude prompt cache hit rate near zero even though caching is enabled?

Four common causes: (1) variable content like a timestamp or username appears above stable content, breaking the prefix every request; (2) the cache_control breakpoint is placed inside a frequently-changing block instead of at the boundary; (3) the cached prefix is under the 1024/2048-token minimum and the directive is silently ignored; (4) traffic is bursty enough that the 5-minute TTL keeps expiring between requests.

How do I monitor Claude prompt cache performance in production?

Every Claude API response includes cache_creation_input_tokens and cache_read_input_tokens. Log these to a metrics pipeline and alert on two conditions: cache hit ratio (cached reads divided by total input tokens) falling below ~75%, and cache write tokens spiking above your baseline. Either signal almost always means a teammate added variable content above the stable prefix.

References

Prompt caching — Anthropic API documentation — Anthropic

To cite this article: Iron Mind AI. (2026). "Prompt Caching: How We Cut Claude API Costs by 90% in Production". Iron Mind AI Blog. https://iron-mind.ai/blog/prompt-caching-claude-production

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call