Prompt caching cuts agent API costs 41–80% and speeds time-to-first-token 13–31%

January 9, 20268 min

Overview

Decision SnapshotReady For Pilot

Solid cross-provider tests and ablations support consistent cost savings; TTFT effects depend on provider behavior so test in your environment before deploying.

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 80%

Novelty: 40%

Authors

Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

Links

Abstract / PDF / Data

Why It Matters For Business

Prompt caching can cut multi-turn agent API bills by roughly half or more and often speed initial response; apply it to reduce cloud costs and improve user-perceived latency.

Who Should Care

Summary TLDR

This paper measures provider prompt caching (reusing KV tensors) for long, tool-heavy agent sessions. Across OpenAI, Anthropic, and Google flagship models on DeepResearch Bench, caching reduced API cost by 41–80% and improved time to first token (TTFT) by 13–31% when using targeted cache strategies. Key practical advice: cache only stable prefixes (e.g., the system prompt), place dynamic values after the cacheable prefix, and avoid naive full-context caching which can increase latency. Benefits scale with prompt size and hold across 3–50 tool calls; latency gains depend on provider behavior and minimum caching thresholds.

Problem Statement

Agentic LLM sessions grow large and expensive as agents call tools repeatedly. Providers offer prompt caching (reusing attention key-value tensors) but we lack systematic, cross-provider measurements and guidance for long-horizon, tool-heavy agent workloads.

Main Contribution

First cross-provider, empirical evaluation of provider prompt caching on multi-turn agentic sessions.

Quantified cost and latency effects across four flagship models and three providers on DeepResearch Bench.

Key Findings

Prompt caching reliably reduces API cost.

NumbersCost reduced 41%–80% vs no-cache (Table 1)

Practical UseEnable provider prompt caching and ensure a stable cacheable prefix (system prompt) to cut token bills substantially.

Evidence RefTable 1, Table 2

Latency gains vary by provider and strategy.

NumbersTTFT improved 13%–31% (best cache modes across models)

Practical UseMeasure TTFT per provider; pick cache mode for latency (system prompt only or exclude tool results) rather than assuming full-context caching is best.

Evidence RefFigure 1, Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cost reduction (best cache mode)41%–80% lower cost vs no-cacheNo-cache baseline−41% to −80%DeepResearch Bench, 500 sessionsTable 1 shows best-mode cost ↓: GPT-5.2 79.6%, Claude 78.5%, Gemini 41.4%, GPT-4o 45.9%Table 1
Time-to-first-token (TTFT) improvement (best cache mode)13%–31% fasterNo-cache baseline−13% to −31%DeepResearch BenchTable 1: TTFT ↓: GPT-5.2 13.0%, Claude 22.9%, Gemini 6.1%, GPT-4o 30.9%Table 1

What To Try In 7 Days

Measure current session token breakdown (system prompt vs dynamic content).

Enable provider prompt caching and run representative sessions to measure cost and TTFT.

Move dynamic values (timestamps, IDs, tool results) after a UUID cache breaker to preserve a reusable system-prompt prefix.

Agent Features

Memory
prompt cache (KV tensor reuse)short-term context window
Planning
multi-turn tool callingiterative web research
Tool Use
web search toolfunction calling APIs
Frameworks
LangChain / Deep AgentsProvider-managed prompt caches (OpenAI/Anthropic/Google)
Is Agentic

Yes

Architectures
LLM agents with function calling

Optimization Features

Token Efficiency
cached input tokens billed at lower ratessystem prompt caching yields largest token savings
Infra Optimization
consider provider minimum token thresholds and TTLsaccount for cache write/read pricing differences
System Optimization
place dynamic values after cacheable prefixavoid embedding volatile tool definitions in system prompt
Inference Optimization
prompt caching (reuse KV tensors)cache boundary control with UUIDsexclude dynamic tool results from cache

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

DeepResearch Bench (Du et al., 2025)

Risks & Boundaries

Limitations

Experiments run on DeepResearch Bench; results may differ for other workloads.

Provider pricing, TTLs, and caching behavior reflect early-2026 documentation and can change.

When Not To Use

Short prompts below provider minimum tokens (no caching benefit).

Sessions where tool outputs are unique and never repeat across requests.

Failure Modes

Full-context caching writes dynamic content, creating overhead that can increase latency.

Including volatile tokens in the system prompt breaks cache reuse across sessions.

Core Entities

Models

OpenAI GPT-5.2OpenAI GPT-4oAnthropic Claude Sonnet 4.5Google Gemini 2.5 Pro

Metrics

API cost (USD per session)Time to first token (TTFT)

Datasets

DeepResearch Bench

Benchmarks

DeepResearch Bench

Context Entities

Metrics

cached input tokenscache write tokenscache read tokens

Datasets

DeepResearch Bench (100 research tasks)