AI agents boost capabilities but multiply inference cost, latency variance, and datacenter power needs.

June 4, 20258 min

Overview

Decision SnapshotNeeds Validation

Clear measurements on common agents and benchmarks show large infrastructure costs and concrete optimization wins (prefix caching, batching). Results are grounded but limited to selected agents, models (8B/70B), and GPU-based serving.

Citations1

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 30%

Novelty: 60%

Authors

Jiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AI agents can raise per-query compute and energy by 10s–100s×, driving much higher cloud costs and datacenter power needs; without cost-aware designs, agent features can become economically and environmentally unsustainable.

Who Should Care

Summary TLDR

This paper measures the system-level cost of LLM-based AI agents that perform multi-step, tool-augmented reasoning. Across common agents and benchmarks, agents make many more LLM and tool calls per user request, inflate GPU memory needs, and raise per-query energy by 62×–137× versus single-turn LLMs. Prefix caching, request batching, and parallel reasoning reduce some overheads, but test-time scaling shows sharply diminishing accuracy returns and can push datacenter power toward gigawatt scales under heavy traffic. The paper calls for compute-aware agent design (smaller models, routing, caching, and adaptive budgets) to balance accuracy with deployability.

Problem Statement

Agentic LLMs replace single-pass inference with iterative planning, tool calls, and reflection. This improves capabilities but creates unpredictable multi-call workloads that raise latency variance, GPU idle time, KV-cache memory pressure, per-query energy, and datacenter power demand. The community lacks a system-level quantification of these costs and practical guidance to make agents deployable at scale.

Main Contribution

First system-level, quantitative characterization of representative AI agent workflows (CoT, ReAct, Reflexion, LATS, LLMCompiler) across multiple benchmarks.

Measured how agent workflows change LLM/tool-call counts, latency breakdown, GPU utilization, and KV-cache memory pressure.

Key Findings

Agentic systems issue many more LLM calls per request than single-turn models.

NumbersAgents average 9.2× more LLM calls; LATS averages 71 LLM calls/request.

Practical UseExpect multi-digit increases in per-request compute; tune iteration budgets and prefer parallel strategies when latency is critical.

Evidence RefFig.4 and Sec. IV-A

LLM inference and tool execution split overall latency roughly two-to-one.

NumbersLLM inference ≈69.4% latency; tools ≈30.2%.

Practical UseOptimize both LLM serving and tool speed; long external API calls can dominate end-to-end latency and create GPU idle time.

Evidence RefFig.5 and Sec. IV-A

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average LLM calls per request (agents vs CoT)9.2× more LLM calls vs CoT; LATS ≈71 calls/requestCoT single-call9.2×aggregated across benchmarksFig.4 and Sec. IV-AFig.4
Latency split (LLM vs tool)LLM 69.4% / tool 30.2% of total latencytotal end-to-endagentic workloadsFig.5 and Sec. IV-AFig.5

What To Try In 7 Days

Enable prefix caching and measure KV-cache memory to reduce prefill cost.

Set conservative iteration budgets and log tail-latency outliers to control costs.

Prototype mixed-size model routing: route planning to smaller models and critical reasoning to larger models.

Agent Features

Memory
short-term interaction historylong-term reflections (Reflexion)
Planning
explicit structured planning (DAG)tree search (LATS)
Tool Use
external APIs (Wikipedia, Wolfram Alpha)web interaction tools (WebShop)code execution (HumanEval Python runner)
Frameworks
ReActReflexionLATSLLMCompilerCoT
Is Agentic

Yes

Architectures
multi-step LLM pipelinesDAG planning (LLMCompiler)
Collaboration
not the focus (single-agent workflows analyzed)

Optimization Features

Token Efficiency
few-shot prompt tuning to reduce stepsprompt length trade-offs discussed
Infra Optimization
mixed-model routing (small+large)KV cache compression and pruning (discussed)
Model Optimization
quantization (discussed)distillation (discussed)sparse architectures (discussed)
System Optimization
concurrent request schedulingadaptive scaling and carbon-aware execution
Training Optimization
not central to this paper
Inference Optimization
prefix cachingtoken-level batching (vLLM)parallel LLM calls (LATS)prefill-decode disaggregation (discussed)speculative decoding (discussed)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HotpotQAWebShopMATHHumanEvalShareGPT

Risks & Boundaries

Limitations

Evaluations use two model sizes (8B, 70B) — extrapolation to much larger models is uncertain.

Energy and power estimates include GPU energy only; CPU, network, cooling overheads are omitted.

When Not To Use

If strict low per-query cost or strict SLA latency is required without infrastructure upgrades.

When external tools are extremely high-latency and cannot be internalized or cached.

Failure Modes

Long-tail outlier requests consume full iteration budget and inflate average cost.

GPU underutilization during tool waits causing inefficient amortization of hardware.

Core Entities

Models

Llama-3.1-8B-InstructLlama-3.1-70B-Instruct

Metrics

end-to-end latencythroughput (QPS)GPU energy (Wh/query)Accuracy

Datasets

HotpotQAWebShopMATHHumanEvalShareGPT

Benchmarks

HotpotQAWebShopMATHHumanEvalShareGPT