Prompt caching cuts agent API costs 41–80% and speeds time-to-first-token 13–31%

Overview

Decision SnapshotReady For Pilot

Solid cross-provider tests and ablations support consistent cost savings; TTFT effects depend on provider behavior so test in your environment before deploying.

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 80%

Novelty: 40%

Authors

Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

Links

Abstract / PDF / Data

Why It Matters For Business

Prompt caching can cut multi-turn agent API bills by roughly half or more and often speed initial response; apply it to reduce cloud costs and improve user-perceived latency.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Founder

Summary TLDR

This paper measures provider prompt caching (reusing KV tensors) for long, tool-heavy agent sessions. Across OpenAI, Anthropic, and Google flagship models on DeepResearch Bench, caching reduced API cost by 41–80% and improved time to first token (TTFT) by 13–31% when using targeted cache strategies. Key practical advice: cache only stable prefixes (e.g., the system prompt), place dynamic values after the cacheable prefix, and avoid naive full-context caching which can increase latency. Benefits scale with prompt size and hold across 3–50 tool calls; latency gains depend on provider behavior and minimum caching thresholds.

Problem Statement

Agentic LLM sessions grow large and expensive as agents call tools repeatedly. Providers offer prompt caching (reusing attention key-value tensors) but we lack systematic, cross-provider measurements and guidance for long-horizon, tool-heavy agent workloads.

Main Contribution

First cross-provider, empirical evaluation of provider prompt caching on multi-turn agentic sessions.

Quantified cost and latency effects across four flagship models and three providers on DeepResearch Bench.

Key Findings

Prompt caching reliably reduces API cost.

NumbersCost reduced 41%–80% vs no-cache (Table 1)

Practical UseEnable provider prompt caching and ensure a stable cacheable prefix (system prompt) to cut token bills substantially.

Evidence RefTable 1, Table 2

Latency gains vary by provider and strategy.

NumbersTTFT improved 13%–31% (best cache modes across models)

Practical UseMeasure TTFT per provider; pick cache mode for latency (system prompt only or exclude tool results) rather than assuming full-context caching is best.

Evidence RefFigure 1, Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Cost reduction (best cache mode)	41%–80% lower cost vs no-cache	No-cache baseline	−41% to −80%	DeepResearch Bench, 500 sessions	Table 1 shows best-mode cost ↓: GPT-5.2 79.6%, Claude 78.5%, Gemini 41.4%, GPT-4o 45.9%	Table 1
Time-to-first-token (TTFT) improvement (best cache mode)	13%–31% faster	No-cache baseline	−13% to −31%	DeepResearch Bench	Table 1: TTFT ↓: GPT-5.2 13.0%, Claude 22.9%, Gemini 6.1%, GPT-4o 30.9%	Table 1

What To Try In 7 Days

Measure current session token breakdown (system prompt vs dynamic content).

Enable provider prompt caching and run representative sessions to measure cost and TTFT.

Move dynamic values (timestamps, IDs, tool results) after a UUID cache breaker to preserve a reusable system-prompt prefix.

Agent Features

Memory

prompt cache (KV tensor reuse)short-term context window

Planning

multi-turn tool callingiterative web research

Tool Use

web search toolfunction calling APIs

Frameworks

LangChain / Deep AgentsProvider-managed prompt caches (OpenAI/Anthropic/Google)

Is Agentic

Yes

Architectures

LLM agents with function calling

Optimization Features

Token Efficiency

cached input tokens billed at lower ratessystem prompt caching yields largest token savings

Infra Optimization

consider provider minimum token thresholds and TTLsaccount for cache write/read pricing differences

System Optimization

place dynamic values after cacheable prefixavoid embedding volatile tool definitions in system prompt

Inference Optimization

prompt caching (reuse KV tensors)cache boundary control with UUIDsexclude dynamic tool results from cache

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

DeepResearch Bench (Du et al., 2025)

Risks & Boundaries

Limitations

Experiments run on DeepResearch Bench; results may differ for other workloads.

Provider pricing, TTLs, and caching behavior reflect early-2026 documentation and can change.

When Not To Use

Short prompts below provider minimum tokens (no caching benefit).

Sessions where tool outputs are unique and never repeat across requests.

Failure Modes

Full-context caching writes dynamic content, creating overhead that can increase latency.

Including volatile tokens in the system prompt breaks cache reuse across sessions.

Core Entities

Models

OpenAI GPT-5.2OpenAI GPT-4oAnthropic Claude Sonnet 4.5Google Gemini 2.5 Pro

Metrics

API cost (USD per session)Time to first token (TTFT)

Datasets

DeepResearch Bench

Benchmarks

DeepResearch Bench

Context Entities

Metrics

cached input tokenscache write tokenscache read tokens

Datasets

DeepResearch Bench (100 research tasks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prompt caching reliably reduces API cost.

Latency gains vary by provider and strategy.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Metrics

Datasets

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding