Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.9
Citation Count
1
Why It Matters For Business
AI agents can raise per-query compute and energy by 10s–100s×, driving much higher cloud costs and datacenter power needs; without cost-aware designs, agent features can become economically and environmentally unsustainable.
Summary TLDR
This paper measures the system-level cost of LLM-based AI agents that perform multi-step, tool-augmented reasoning. Across common agents and benchmarks, agents make many more LLM and tool calls per user request, inflate GPU memory needs, and raise per-query energy by 62×–137× versus single-turn LLMs. Prefix caching, request batching, and parallel reasoning reduce some overheads, but test-time scaling shows sharply diminishing accuracy returns and can push datacenter power toward gigawatt scales under heavy traffic. The paper calls for compute-aware agent design (smaller models, routing, caching, and adaptive budgets) to balance accuracy with deployability.
Problem Statement
Agentic LLMs replace single-pass inference with iterative planning, tool calls, and reflection. This improves capabilities but creates unpredictable multi-call workloads that raise latency variance, GPU idle time, KV-cache memory pressure, per-query energy, and datacenter power demand. The community lacks a system-level quantification of these costs and practical guidance to make agents deployable at scale.
Main Contribution
First system-level, quantitative characterization of representative AI agent workflows (CoT, ReAct, Reflexion, LATS, LLMCompiler) across multiple benchmarks.
Measured how agent workflows change LLM/tool-call counts, latency breakdown, GPU utilization, and KV-cache memory pressure.
Quantified test-time scaling trade-offs (sequential vs parallel), prefix-caching benefits, throughput effects, and per-query GPU energy leading to datacenter power estimates.
Key Findings
Agentic systems issue many more LLM calls per request than single-turn models.
LLM inference and tool execution split overall latency roughly two-to-one.
Prefix caching sharply reduces redundant prefill cost and KV memory for agents with long histories.
Concurrent request scheduling converts agentic GPU idle time into throughput gains.
Per-query GPU energy for agents is orders of magnitude higher than single-turn LLM inference.
Test-time scaling shows diminishing returns: small accuracy gains at large extra cost.
Results
Average LLM calls per request (agents vs CoT)
Latency split (LLM vs tool)
Prefill latency reduction with prefix caching
KV cache memory reduction with prefix caching (serving)
Throughput (QPS) comparison
GPU energy per query (Wh)
Who Should Care
What To Try In 7 Days
Enable prefix caching and measure KV-cache memory to reduce prefill cost.
Set conservative iteration budgets and log tail-latency outliers to control costs.
Prototype mixed-size model routing: route planning to smaller models and critical reasoning to larger models.
Agent Features
Memory
- short-term interaction history
- long-term reflections (Reflexion)
Planning
- explicit structured planning (DAG)
- tree search (LATS)
Tool Use
- external APIs (Wikipedia, Wolfram Alpha)
- web interaction tools (WebShop)
- code execution (HumanEval Python runner)
Frameworks
- ReAct
- Reflexion
- LATS
- LLMCompiler
- CoT
Is Agentic
true
Architectures
- multi-step LLM pipelines
- DAG planning (LLMCompiler)
Collaboration
- not the focus (single-agent workflows analyzed)
Optimization Features
Token Efficiency
- few-shot prompt tuning to reduce steps
- prompt length trade-offs discussed
Infra Optimization
- mixed-model routing (small+large)
- KV cache compression and pruning (discussed)
Model Optimization
- quantization (discussed)
- distillation (discussed)
- sparse architectures (discussed)
System Optimization
- concurrent request scheduling
- adaptive scaling and carbon-aware execution
Training Optimization
- not central to this paper
Inference Optimization
- prefix caching
- token-level batching (vLLM)
- parallel LLM calls (LATS)
- prefill-decode disaggregation (discussed)
- speculative decoding (discussed)
Reproducibility
Data Urls
- HotpotQA
- WebShop
- MATH
- HumanEval
- ShareGPT
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations use two model sizes (8B, 70B) — extrapolation to much larger models is uncertain.
- Energy and power estimates include GPU energy only; CPU, network, cooling overheads are omitted.
- Benchmarks cover several tasks but not all real-world agent workloads; external API latency patterns vary by deployment.
When Not To Use
- If strict low per-query cost or strict SLA latency is required without infrastructure upgrades.
- When external tools are extremely high-latency and cannot be internalized or cached.
- For very high-volume, low-value queries where single-turn LLMs suffice.
Failure Modes
- Long-tail outlier requests consume full iteration budget and inflate average cost.
- GPU underutilization during tool waits causing inefficient amortization of hardware.
- KV-cache memory blowup from growing interaction histories, limiting concurrent scale.
Core Entities
Models
- Llama-3.1-8B-Instruct
- Llama-3.1-70B-Instruct
Metrics
- end-to-end latency
- throughput (QPS)
- GPU energy (Wh/query)
- Accuracy
Datasets
- HotpotQA
- WebShop
- MATH
- HumanEval
- ShareGPT
Benchmarks
- HotpotQA
- WebShop
- MATH
- HumanEval
- ShareGPT

