Prompt caching cuts agent API costs 41–80% and speeds time-to-first-token 13–31%

January 9, 20268 min

Overview

Production Readiness

0.8

Novelty Score

0.4

Cost Impact Score

0.9

Citation Count

0

Authors

Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

Links

Abstract / PDF

Why It Matters For Business

Prompt caching can cut multi-turn agent API bills by roughly half or more and often speed initial response; apply it to reduce cloud costs and improve user-perceived latency.

Summary TLDR

This paper measures provider prompt caching (reusing KV tensors) for long, tool-heavy agent sessions. Across OpenAI, Anthropic, and Google flagship models on DeepResearch Bench, caching reduced API cost by 41–80% and improved time to first token (TTFT) by 13–31% when using targeted cache strategies. Key practical advice: cache only stable prefixes (e.g., the system prompt), place dynamic values after the cacheable prefix, and avoid naive full-context caching which can increase latency. Benefits scale with prompt size and hold across 3–50 tool calls; latency gains depend on provider behavior and minimum caching thresholds.

Problem Statement

Agentic LLM sessions grow large and expensive as agents call tools repeatedly. Providers offer prompt caching (reusing attention key-value tensors) but we lack systematic, cross-provider measurements and guidance for long-horizon, tool-heavy agent workloads.

Main Contribution

First cross-provider, empirical evaluation of provider prompt caching on multi-turn agentic sessions.

Quantified cost and latency effects across four flagship models and three providers on DeepResearch Bench.

Compared four cache modes and recommended practical cache boundary strategies.

Ablation study across prompt sizes (500–50,000 tokens) and tool counts (3–50) with concrete takeaways.

Key Findings

Prompt caching reliably reduces API cost.

NumbersCost reduced 41%–80% vs no-cache (Table 1)

Latency gains vary by provider and strategy.

NumbersTTFT improved 13%–31% (best cache modes across models)

Naive full-context caching can hurt latency if it caches dynamic content.

NumbersFull-context caused TTFT regression up to −8.8% for GPT-4o (Table 2)

Caching benefits scale with prompt size more than tool count.

NumbersCost savings scale linearly from ~10% (500 tokens) to 54%–89% (50k tokens) (Figure 4)

Results

Cost reduction (best cache mode)

Value41%–80% lower cost vs no-cache

BaselineNo-cache baseline

Time-to-first-token (TTFT) improvement (best cache mode)

Value13%–31% faster

BaselineNo-cache baseline

Prompt size scaling (cost)

ValueCost savings grow with prompt size

BaselineSmall prompts (500 tokens)

Minimum token thresholds

ValueCache inactive below provider thresholds

BaselinePrompts below thresholds

Who Should Care

What To Try In 7 Days

Measure current session token breakdown (system prompt vs dynamic content).

Enable provider prompt caching and run representative sessions to measure cost and TTFT.

Move dynamic values (timestamps, IDs, tool results) after a UUID cache breaker to preserve a reusable system-prompt prefix.

Agent Features

Memory

  • prompt cache (KV tensor reuse)
  • short-term context window

Planning

  • multi-turn tool calling
  • iterative web research

Tool Use

  • web search tool
  • function calling APIs

Frameworks

  • LangChain / Deep Agents
  • Provider-managed prompt caches (OpenAI/Anthropic/Google)

Is Agentic

true

Architectures

  • LLM agents with function calling

Optimization Features

Token Efficiency

  • cached input tokens billed at lower rates
  • system prompt caching yields largest token savings

Infra Optimization

  • consider provider minimum token thresholds and TTLs
  • account for cache write/read pricing differences

System Optimization

  • place dynamic values after cacheable prefix
  • avoid embedding volatile tool definitions in system prompt

Inference Optimization

  • prompt caching (reuse KV tensors)
  • cache boundary control with UUIDs
  • exclude dynamic tool results from cache

Reproducibility

Data Urls

  • DeepResearch Bench (Du et al., 2025)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run on DeepResearch Bench; results may differ for other workloads.
  • Provider pricing, TTLs, and caching behavior reflect early-2026 documentation and can change.
  • TTFT measurements show variance from network and server load; latency results are provider-dependent.
  • Paper did not measure downstream task correctness or user-facing answer quality tied to caching.

When Not To Use

  • Short prompts below provider minimum tokens (no caching benefit).
  • Sessions where tool outputs are unique and never repeat across requests.
  • When timing side-channel risk is unacceptable in multi-tenant settings.

Failure Modes

  • Full-context caching writes dynamic content, creating overhead that can increase latency.
  • Including volatile tokens in the system prompt breaks cache reuse across sessions.
  • Provider TTLs or cache eviction patterns may cause inconsistent cache hits.

Core Entities

Models

  • OpenAI GPT-5.2
  • OpenAI GPT-4o
  • Anthropic Claude Sonnet 4.5
  • Google Gemini 2.5 Pro

Metrics

  • API cost (USD per session)
  • Time to first token (TTFT)

Datasets

  • DeepResearch Bench

Benchmarks

  • DeepResearch Bench

Context Entities

Metrics

  • cached input tokens
  • cache write tokens
  • cache read tokens

Datasets

  • DeepResearch Bench (100 research tasks)