Overview
Solid cross-provider tests and ablations support consistent cost savings; TTFT effects depend on provider behavior so test in your environment before deploying.
Citations0
Evidence Strength0.85
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 90%
Production readiness: 80%
Novelty: 40%
Why It Matters For Business
Prompt caching can cut multi-turn agent API bills by roughly half or more and often speed initial response; apply it to reduce cloud costs and improve user-perceived latency.
Who Should Care
Summary TLDR
This paper measures provider prompt caching (reusing KV tensors) for long, tool-heavy agent sessions. Across OpenAI, Anthropic, and Google flagship models on DeepResearch Bench, caching reduced API cost by 41–80% and improved time to first token (TTFT) by 13–31% when using targeted cache strategies. Key practical advice: cache only stable prefixes (e.g., the system prompt), place dynamic values after the cacheable prefix, and avoid naive full-context caching which can increase latency. Benefits scale with prompt size and hold across 3–50 tool calls; latency gains depend on provider behavior and minimum caching thresholds.
Problem Statement
Agentic LLM sessions grow large and expensive as agents call tools repeatedly. Providers offer prompt caching (reusing attention key-value tensors) but we lack systematic, cross-provider measurements and guidance for long-horizon, tool-heavy agent workloads.
Main Contribution
First cross-provider, empirical evaluation of provider prompt caching on multi-turn agentic sessions.
Quantified cost and latency effects across four flagship models and three providers on DeepResearch Bench.
Key Findings
Prompt caching reliably reduces API cost.
Latency gains vary by provider and strategy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Cost reduction (best cache mode) | 41%–80% lower cost vs no-cache | No-cache baseline | −41% to −80% | DeepResearch Bench, 500 sessions | Table 1 shows best-mode cost ↓: GPT-5.2 79.6%, Claude 78.5%, Gemini 41.4%, GPT-4o 45.9% | Table 1 |
| Time-to-first-token (TTFT) improvement (best cache mode) | 13%–31% faster | No-cache baseline | −13% to −31% | DeepResearch Bench | Table 1: TTFT ↓: GPT-5.2 13.0%, Claude 22.9%, Gemini 6.1%, GPT-4o 30.9% | Table 1 |
What To Try In 7 Days
Measure current session token breakdown (system prompt vs dynamic content).
Enable provider prompt caching and run representative sessions to measure cost and TTFT.
Move dynamic values (timestamps, IDs, tool results) after a UUID cache breaker to preserve a reusable system-prompt prefix.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments run on DeepResearch Bench; results may differ for other workloads.
Provider pricing, TTLs, and caching behavior reflect early-2026 documentation and can change.
When Not To Use
Short prompts below provider minimum tokens (no caching benefit).
Sessions where tool outputs are unique and never repeat across requests.
Failure Modes
Full-context caching writes dynamic content, creating overhead that can increase latency.
Including volatile tokens in the system prompt breaks cache reuse across sessions.

