Prompt caching | Doubleword Inference API

Prompt caching lets you reuse a large, unchanging prefix of your prompt across many requests and pay a steeply reduced rate for the repeated tokens. Mark the reusable content once with a cache_control breakpoint; every later request that begins with the same content is billed at ~10% of the normal input rate for that prefix.

It works on the OpenAI Chat Completions API (/v1/chat/completions) and the Anthropic Messages API (/v1/messages) — the newer OpenAI Responses API (/v1/responses) isn't supported yet. The marker only affects billing — the model's output is identical with or without it.

How it works

A breakpoint (cache_control) marks the end of a cacheable prefix: everything from the start of the prompt up to and including that block.

The first request with a given prefix writes it — a cache creation.
Later requests that begin with the same prefix read it — a cache read, billed at the discounted rate.

Both the write and the reads are driven by the marker — a prefix isn't cached once and then reused automatically. Every request that should hit the cache must carry the same cache_control marker; a request sent without markers reads nothing, even if an identical prefix is already cached.

Matching is prefix-based and left-anchored: a request reuses the cache up to the longest leading run of content that is byte-for-byte identical to something cached before. As soon as content diverges, everything after it is treated as new.

Request A (writes):  [ tools + system ]●[ user question 1 ]
Request B (reads):   [ tools + system ]●[ user question 2 ]
                      └── identical ──┘   └── different ──┘
                         cache READ          full price

(● = breakpoint.)

TTL — each breakpoint lasts "5m" (the default) or "1h". The TTL is a sliding window: every read refreshes the expiry, so an actively-reused prefix stays warm and only expires after a full TTL with no reuse. Use 5m for bursty, back-to-back calls; 1h for interactive sessions.

Privacy & sharing — a cached prefix is scoped to the API key's owner and is never shared across customers. This gives you a choice of scope: use a key from your personal account to keep caching to yourself, or use one of your organization's keys to share prefix caching across your whole org — any org key can read a prefix that another org key wrote.

Quick start

Put the reusable content in an array-style content block and add the marker. Send the request twice — the second call reads the prefix from cache.

POST https://api.doubleword.ai/v1/chat/completions
{
  "model": "{{selectedModel.id}}",
  "messages": [
    { "role": "system", "content": [
      { "type": "text",
        "text": "<~2,000 tokens of stable instructions>",
        "cache_control": { "type": "ephemeral", "ttl": "1h" } } ] },
    { "role": "user", "content": "How do I reset my password?" }
  ]
}

POST https://api.doubleword.ai/v1/messages
{
  "model": "{{selectedModel.id}}",
  "max_tokens": 256,
  "system": [
    { "type": "text",
      "text": "<~2,000 tokens of stable instructions>",
      "cache_control": { "type": "ephemeral", "ttl": "1h" } } ],
  "messages": [ { "role": "user", "content": "How do I reset my password?" } ]
}

Read the result back from usage:

// 1st call — writes the prefix
"usage": { "prompt_tokens": 2088, "cache_creation_input_tokens": 2048, "cache_read_input_tokens": 0 }

// 2nd call — reads it; only the new user turn is billed at full price
"usage": { "prompt_tokens": 2096, "cache_read_input_tokens": 2048, "cache_creation_input_tokens": 0,
           "prompt_tokens_details": { "cached_tokens": 2048 } }

// 1st call — writes the prefix
"usage": { "input_tokens": 2088, "cache_creation_input_tokens": 2048, "cache_read_input_tokens": 0 }

// 2nd call — reads it; input_tokens counts only the new turn (cached tokens are reported separately)
"usage": { "input_tokens": 48, "cache_read_input_tokens": 2048, "cache_creation_input_tokens": 0 }

Streaming works the same way — the cache fields arrive in the final usage chunk.

Multiple breakpoints

You can place up to 4 breakpoints in one request, and using more than one pays off when your prompt has segments that change at different frequencies.

Take a common case: your tools never change, but you have a small set of system prompts — say one per task or agent mode — and pick one for each request. With a single breakpoint at the end of the system prompt, each system prompt caches tools + system together as one entry — so the identical tools get cached again under every variant, and you pay to write the tools once for each system prompt.

Put two breakpoints instead — one after the tools, one after the system prompt — and each segment is cached on its own cadence: the tools are written once and read by every system-prompt variant, and each system prompt is cached on top of them.

[ tools ]①            ← cached once; read by every system-prompt variant
[ system prompt ]②    ← cached per distinct system prompt
[ user message ]

The rule: put a breakpoint at the end of each segment that changes on its own cadence — most-stable first (tools), then less-stable (system prompt, retrieved documents), then the conversation. Each layer is written once and reused until it changes, so you never re-pay to cache the stable parts underneath.

Patterns

System prompt & tools

Tool definitions and a long system prompt are identical on every call — cache them as the stable prefix and vary only the user turn. Tools sit first in the prefix, so a cache_control on the last tool caches every tool before it; a marker on the system block caches the tools and the system prompt.

{
  "model": "{{selectedModel.id}}",
  "tools": [
    { "type": "function", "function": { "name": "search", "parameters": {} } },
    { "type": "function", "function": { "name": "book_flight", "parameters": {} },
      "cache_control": { "type": "ephemeral", "ttl": "1h" } }
  ],
  "messages": [
    { "role": "system", "content": [
      { "type": "text", "text": "<agent instructions>",
        "cache_control": { "type": "ephemeral", "ttl": "1h" } } ] },
    { "role": "user", "content": "Book me a flight to Tokyo next Friday." }
  ]
}

{
  "model": "{{selectedModel.id}}",
  "max_tokens": 512,
  "tools": [
    { "name": "search",      "input_schema": { "type": "object" } },
    { "name": "book_flight", "input_schema": { "type": "object" },
      "cache_control": { "type": "ephemeral", "ttl": "1h" } }
  ],
  "system": [ { "type": "text", "text": "<agent instructions>",
               "cache_control": { "type": "ephemeral", "ttl": "1h" } } ],
  "messages": [ { "role": "user", "content": "Book me a flight to Tokyo next Friday." } ]
}

RAG / document Q&A

Cache a large retrieved document once, then ask many questions about it — each question only pays full price for itself.

{
  "model": "{{selectedModel.id}}",
  "messages": [
    { "role": "system", "content": [ { "type": "text", "text": "Answer only from the document." } ] },
    { "role": "user", "content": [
      { "type": "text", "text": "<the 50-page contract>",
        "cache_control": { "type": "ephemeral", "ttl": "1h" } },
      { "type": "text", "text": "What is the termination notice period?" }
    ] }
  ]
}

The next question about the same document reads the whole contract from cache; only the new question is billed at full rate.

Growing conversation

Multi-turn chat is layered caching with a moving target: the system and tools stay fixed, any retrieved docs are per-session, and the conversation grows a message at a time. Put a breakpoint on each fixed layer, and — because caching isn't automatic — put a cache_control marker on the latest message in every request. Each turn then reads the whole conversation so far from cache at the discounted rate, and the only tokens that aren't already cached — what you added since your last request (the previous reply plus your new message) — are written at the write rate, ready to read cheaply next turn. You never re-write earlier turns, so a long conversation stays cheap: a discounted read of the history plus a small write on just the new tail.

Turn 1:  [sys+tools]①[docs]②[user₁ ●]                  → mark user₁; writes through user₁
Turn 2:  [sys+tools]①[docs]②[user₁][asst₁][user₂ ●]    → mark user₂; reads through asst₁, writes the new turn
Turn 3:  [sys+tools]①[docs]②…[asst₂][user₃ ●]          → mark user₃; reads through asst₂, writes the new turn

Because matching is longest-common-prefix, a follow-up that shares only breakpoint ① still reads that layer and re-caches from there — you don't have to plan reuse perfectly.

Pricing

Operation	Rate vs. standard input
Cache read (5m or 1h)	0.1× (90% off)
Cache write — 5m	1.25×
Cache write — 1h	2×
Uncached input / output	standard

Because a write costs a small premium and every read is cheap, caching pays off once a prefix is reused even a handful of times. Exact per-token prices are on each model's pricing page.

What you can cache

cache_control works on content blocks throughout the request, in the tools → system → messages order:

Tool definitions — objects in the tools array.
System content blocks.
User and assistant text content blocks.
Tool results — a role: "tool" message (Chat Completions) or a tool_result block (Messages API).

A few things worth knowing:

Images & documents still benefit from caching when they sit inside a cached prefix (and changing one invalidates everything after it), but the image/document tokens themselves aren't discounted — only the text tokens in the prefix are.
Empty text blocks and sub-content blocks (like citations) can't be marked directly — cache the top-level block instead.
Caching is opt-in and best-effort: the marker requests caching, and you confirm it landed by reading the usage fields (above). On a model that doesn't support caching, the marker is simply ignored and you're billed at standard rates — so it's always safe to include.

Getting the best out of caching

Golden rule: stable content first, volatile content last, breakpoint at the boundary.

[ STABLE PREFIX: tools, system prompt, instructions, examples, documents ] ● breakpoint
[ VOLATILE SUFFIX: this request's actual question ]

Order matters. Everything before the breakpoint must be identical across requests to get a hit. Keep per-request content — the question, timestamps, IDs — after the cached blocks, and send the cached blocks in a consistent order (the same content in a different order is a cache miss).
Keep prefixes byte-identical. A single changed character before a breakpoint — a different date, a reordered sentence — breaks the match for everything after it.
Cache the big stuff. Prefixes shorter than ~1,024 tokens aren't cached, so there's no benefit to marking small blocks.
Don't cache one-offs. A write carries a small premium, so only cache prefixes you'll reuse.

Reference

Marker — add to any content block, or to a tool object in the tools array:

"cache_control": { "type": "ephemeral", "ttl": "5m" }

type must be "ephemeral"; ttl is "5m" (default) or "1h".
Up to 4 breakpoints per request.
Prefixes under ~1,024 tokens are not cached.
Markers change the bill, never the output.

Usage fields — the totals field differs by API; the cache fields are the same on both:

prompt_tokens                                   total input tokens (includes the cached ones)
cache_read_input_tokens                         tokens served from cache — discounted read rate
prompt_tokens_details.cached_tokens             same count as cache_read_input_tokens (OpenAI-standard field)
cache_creation_input_tokens                     tokens written to the cache this request — write rate
cache_creation.ephemeral_{5m,1h}_input_tokens   write breakdown per TTL tier

input_tokens                                    input tokens NOT served from cache (cached counted separately)
cache_read_input_tokens                         tokens served from cache — discounted read rate
cache_creation_input_tokens                     tokens written to the cache this request — write rate
cache_creation.ephemeral_{5m,1h}_input_tokens   write breakdown per TTL tier

Note

Cache token counts in usage are derived from the content you send and are used for billing; they can differ slightly from a model's internal tokenisation (tool definitions in particular are tokenised from their JSON). The discount is always applied consistently to the counts shown.

Further information

Doubleword's prompt caching follows the same model and semantics as Anthropic's, so their guidance carries over directly. For deeper background — cache-aware prompt structure, what invalidates a cache, and more worked examples — see Anthropic's prompt caching guide. Note two Doubleword-specific details: the available TTLs are 5m and 1h, and per-token pricing is per model (above).