AI agents boost capabilities but multiply inference cost, latency variance, and datacenter power needs.

June 4, 20258 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.9

Citation Count

1

Authors

Jiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu

Links

Abstract / PDF

Why It Matters For Business

AI agents can raise per-query compute and energy by 10s–100s×, driving much higher cloud costs and datacenter power needs; without cost-aware designs, agent features can become economically and environmentally unsustainable.

Summary TLDR

This paper measures the system-level cost of LLM-based AI agents that perform multi-step, tool-augmented reasoning. Across common agents and benchmarks, agents make many more LLM and tool calls per user request, inflate GPU memory needs, and raise per-query energy by 62×–137× versus single-turn LLMs. Prefix caching, request batching, and parallel reasoning reduce some overheads, but test-time scaling shows sharply diminishing accuracy returns and can push datacenter power toward gigawatt scales under heavy traffic. The paper calls for compute-aware agent design (smaller models, routing, caching, and adaptive budgets) to balance accuracy with deployability.

Problem Statement

Agentic LLMs replace single-pass inference with iterative planning, tool calls, and reflection. This improves capabilities but creates unpredictable multi-call workloads that raise latency variance, GPU idle time, KV-cache memory pressure, per-query energy, and datacenter power demand. The community lacks a system-level quantification of these costs and practical guidance to make agents deployable at scale.

Main Contribution

First system-level, quantitative characterization of representative AI agent workflows (CoT, ReAct, Reflexion, LATS, LLMCompiler) across multiple benchmarks.

Measured how agent workflows change LLM/tool-call counts, latency breakdown, GPU utilization, and KV-cache memory pressure.

Quantified test-time scaling trade-offs (sequential vs parallel), prefix-caching benefits, throughput effects, and per-query GPU energy leading to datacenter power estimates.

Key Findings

Agentic systems issue many more LLM calls per request than single-turn models.

NumbersAgents average 9.2× more LLM calls; LATS averages 71 LLM calls/request.

LLM inference and tool execution split overall latency roughly two-to-one.

NumbersLLM inference ≈69.4% latency; tools ≈30.2%.

Prefix caching sharply reduces redundant prefill cost and KV memory for agents with long histories.

NumbersPrefill latency −60.1%; end-to-end agent latency −15.7% on average; LATS KV memory −64.8%; serving KV avg −51.7% and max

Concurrent request scheduling converts agentic GPU idle time into throughput gains.

NumbersReAct throughput: sequential 0.10→0.19 QPS; concurrent 2.6→1.2 QPS (25× and 6.2× gains).

Per-query GPU energy for agents is orders of magnitude higher than single-turn LLM inference.

NumbersShareGPT 0.32 Wh (8B) / 2.55 Wh (70B) vs Reflexion 41.53 Wh (8B) / 348.41 Wh (70B); 62.1×–136.5× increase.

Test-time scaling shows diminishing returns: small accuracy gains at large extra cost.

NumbersReflexion: latency 16.9s→25.6s gave +4% accuracy; later gains require ≈31× more cost for similar marginal improvement.

Results

Average LLM calls per request (agents vs CoT)

Value9.2× more LLM calls vs CoT; LATS ≈71 calls/request

BaselineCoT single-call

Latency split (LLM vs tool)

ValueLLM 69.4% / tool 30.2% of total latency

Baselinetotal end-to-end

Prefill latency reduction with prefix caching

Value−60.1% prefill latency; −15.7% end-to-end on average

Baselineno prefix caching

KV cache memory reduction with prefix caching (serving)

Valueavg −51.7%, max −63.5% KV memory

Baselineno prefix caching

Throughput (QPS) comparison

ValueShareGPT 6.4 QPS; ReAct HotpotQA 2.6 QPS; ReAct WebShop 1.2 QPS

BaselineShareGPT single-turn

GPU energy per query (Wh)

ValueShareGPT 0.32 (8B) / 2.55 (70B); Reflexion 41.53 (8B) / 348.41 (70B); LATS 22.76 (8B) / 158.48 (70B)

BaselineShareGPT single-turn

Who Should Care

What To Try In 7 Days

Enable prefix caching and measure KV-cache memory to reduce prefill cost.

Set conservative iteration budgets and log tail-latency outliers to control costs.

Prototype mixed-size model routing: route planning to smaller models and critical reasoning to larger models.

Agent Features

Memory

  • short-term interaction history
  • long-term reflections (Reflexion)

Planning

  • explicit structured planning (DAG)
  • tree search (LATS)

Tool Use

  • external APIs (Wikipedia, Wolfram Alpha)
  • web interaction tools (WebShop)
  • code execution (HumanEval Python runner)

Frameworks

  • ReAct
  • Reflexion
  • LATS
  • LLMCompiler
  • CoT

Is Agentic

true

Architectures

  • multi-step LLM pipelines
  • DAG planning (LLMCompiler)

Collaboration

  • not the focus (single-agent workflows analyzed)

Optimization Features

Token Efficiency

  • few-shot prompt tuning to reduce steps
  • prompt length trade-offs discussed

Infra Optimization

  • mixed-model routing (small+large)
  • KV cache compression and pruning (discussed)

Model Optimization

  • quantization (discussed)
  • distillation (discussed)
  • sparse architectures (discussed)

System Optimization

  • concurrent request scheduling
  • adaptive scaling and carbon-aware execution

Training Optimization

  • not central to this paper

Inference Optimization

  • prefix caching
  • token-level batching (vLLM)
  • parallel LLM calls (LATS)
  • prefill-decode disaggregation (discussed)
  • speculative decoding (discussed)

Reproducibility

Data Urls

  • HotpotQA
  • WebShop
  • MATH
  • HumanEval
  • ShareGPT

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations use two model sizes (8B, 70B) — extrapolation to much larger models is uncertain.
  • Energy and power estimates include GPU energy only; CPU, network, cooling overheads are omitted.
  • Benchmarks cover several tasks but not all real-world agent workloads; external API latency patterns vary by deployment.

When Not To Use

  • If strict low per-query cost or strict SLA latency is required without infrastructure upgrades.
  • When external tools are extremely high-latency and cannot be internalized or cached.
  • For very high-volume, low-value queries where single-turn LLMs suffice.

Failure Modes

  • Long-tail outlier requests consume full iteration budget and inflate average cost.
  • GPU underutilization during tool waits causing inefficient amortization of hardware.
  • KV-cache memory blowup from growing interaction histories, limiting concurrent scale.

Core Entities

Models

  • Llama-3.1-8B-Instruct
  • Llama-3.1-70B-Instruct

Metrics

  • end-to-end latency
  • throughput (QPS)
  • GPU energy (Wh/query)
  • Accuracy

Datasets

  • HotpotQA
  • WebShop
  • MATH
  • HumanEval
  • ShareGPT

Benchmarks

  • HotpotQA
  • WebShop
  • MATH
  • HumanEval
  • ShareGPT