Overview
Production Readiness
0.8
Novelty Score
0.4
Cost Impact Score
0.9
Citation Count
0
Why It Matters For Business
Prompt caching can cut multi-turn agent API bills by roughly half or more and often speed initial response; apply it to reduce cloud costs and improve user-perceived latency.
Summary TLDR
This paper measures provider prompt caching (reusing KV tensors) for long, tool-heavy agent sessions. Across OpenAI, Anthropic, and Google flagship models on DeepResearch Bench, caching reduced API cost by 41–80% and improved time to first token (TTFT) by 13–31% when using targeted cache strategies. Key practical advice: cache only stable prefixes (e.g., the system prompt), place dynamic values after the cacheable prefix, and avoid naive full-context caching which can increase latency. Benefits scale with prompt size and hold across 3–50 tool calls; latency gains depend on provider behavior and minimum caching thresholds.
Problem Statement
Agentic LLM sessions grow large and expensive as agents call tools repeatedly. Providers offer prompt caching (reusing attention key-value tensors) but we lack systematic, cross-provider measurements and guidance for long-horizon, tool-heavy agent workloads.
Main Contribution
First cross-provider, empirical evaluation of provider prompt caching on multi-turn agentic sessions.
Quantified cost and latency effects across four flagship models and three providers on DeepResearch Bench.
Compared four cache modes and recommended practical cache boundary strategies.
Ablation study across prompt sizes (500–50,000 tokens) and tool counts (3–50) with concrete takeaways.
Key Findings
Prompt caching reliably reduces API cost.
Latency gains vary by provider and strategy.
Naive full-context caching can hurt latency if it caches dynamic content.
Caching benefits scale with prompt size more than tool count.
Results
Cost reduction (best cache mode)
Time-to-first-token (TTFT) improvement (best cache mode)
Prompt size scaling (cost)
Minimum token thresholds
Who Should Care
What To Try In 7 Days
Measure current session token breakdown (system prompt vs dynamic content).
Enable provider prompt caching and run representative sessions to measure cost and TTFT.
Move dynamic values (timestamps, IDs, tool results) after a UUID cache breaker to preserve a reusable system-prompt prefix.
Agent Features
Memory
- prompt cache (KV tensor reuse)
- short-term context window
Planning
- multi-turn tool calling
- iterative web research
Tool Use
- web search tool
- function calling APIs
Frameworks
- LangChain / Deep Agents
- Provider-managed prompt caches (OpenAI/Anthropic/Google)
Is Agentic
true
Architectures
- LLM agents with function calling
Optimization Features
Token Efficiency
- cached input tokens billed at lower rates
- system prompt caching yields largest token savings
Infra Optimization
- consider provider minimum token thresholds and TTLs
- account for cache write/read pricing differences
System Optimization
- place dynamic values after cacheable prefix
- avoid embedding volatile tool definitions in system prompt
Inference Optimization
- prompt caching (reuse KV tensors)
- cache boundary control with UUIDs
- exclude dynamic tool results from cache
Reproducibility
Data Urls
- DeepResearch Bench (Du et al., 2025)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments run on DeepResearch Bench; results may differ for other workloads.
- Provider pricing, TTLs, and caching behavior reflect early-2026 documentation and can change.
- TTFT measurements show variance from network and server load; latency results are provider-dependent.
- Paper did not measure downstream task correctness or user-facing answer quality tied to caching.
When Not To Use
- Short prompts below provider minimum tokens (no caching benefit).
- Sessions where tool outputs are unique and never repeat across requests.
- When timing side-channel risk is unacceptable in multi-tenant settings.
Failure Modes
- Full-context caching writes dynamic content, creating overhead that can increase latency.
- Including volatile tokens in the system prompt breaks cache reuse across sessions.
- Provider TTLs or cache eviction patterns may cause inconsistent cache hits.
Core Entities
Models
- OpenAI GPT-5.2
- OpenAI GPT-4o
- Anthropic Claude Sonnet 4.5
- Google Gemini 2.5 Pro
Metrics
- API cost (USD per session)
- Time to first token (TTFT)
Datasets
- DeepResearch Bench
Benchmarks
- DeepResearch Bench
Context Entities
Metrics
- cached input tokens
- cache write tokens
- cache read tokens
Datasets
- DeepResearch Bench (100 research tasks)

