AI agents boost capabilities but multiply inference cost, latency variance, and datacenter power needs.

Overview

Decision SnapshotNeeds Validation

Clear measurements on common agents and benchmarks show large infrastructure costs and concrete optimization wins (prefix caching, batching). Results are grounded but limited to selected agents, models (8B/70B), and GPU-based serving.

Citations1

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 30%

Novelty: 60%

Authors

Jiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AI agents can raise per-query compute and energy by 10s–100s×, driving much higher cloud costs and datacenter power needs; without cost-aware designs, agent features can become economically and environmentally unsustainable.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead CEO

Summary TLDR

This paper measures the system-level cost of LLM-based AI agents that perform multi-step, tool-augmented reasoning. Across common agents and benchmarks, agents make many more LLM and tool calls per user request, inflate GPU memory needs, and raise per-query energy by 62×–137× versus single-turn LLMs. Prefix caching, request batching, and parallel reasoning reduce some overheads, but test-time scaling shows sharply diminishing accuracy returns and can push datacenter power toward gigawatt scales under heavy traffic. The paper calls for compute-aware agent design (smaller models, routing, caching, and adaptive budgets) to balance accuracy with deployability.

Problem Statement

Agentic LLMs replace single-pass inference with iterative planning, tool calls, and reflection. This improves capabilities but creates unpredictable multi-call workloads that raise latency variance, GPU idle time, KV-cache memory pressure, per-query energy, and datacenter power demand. The community lacks a system-level quantification of these costs and practical guidance to make agents deployable at scale.

Main Contribution

First system-level, quantitative characterization of representative AI agent workflows (CoT, ReAct, Reflexion, LATS, LLMCompiler) across multiple benchmarks.

Measured how agent workflows change LLM/tool-call counts, latency breakdown, GPU utilization, and KV-cache memory pressure.

Key Findings

Agentic systems issue many more LLM calls per request than single-turn models.

NumbersAgents average 9.2× more LLM calls; LATS averages 71 LLM calls/request.

Practical UseExpect multi-digit increases in per-request compute; tune iteration budgets and prefer parallel strategies when latency is critical.

Evidence RefFig.4 and Sec. IV-A

LLM inference and tool execution split overall latency roughly two-to-one.

NumbersLLM inference ≈69.4% latency; tools ≈30.2%.

Practical UseOptimize both LLM serving and tool speed; long external API calls can dominate end-to-end latency and create GPU idle time.

Evidence RefFig.5 and Sec. IV-A

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average LLM calls per request (agents vs CoT)	9.2× more LLM calls vs CoT; LATS ≈71 calls/request	CoT single-call	9.2×	aggregated across benchmarks	Fig.4 and Sec. IV-A	Fig.4
Latency split (LLM vs tool)	LLM 69.4% / tool 30.2% of total latency	total end-to-end	—	agentic workloads	Fig.5 and Sec. IV-A	Fig.5

What To Try In 7 Days

Enable prefix caching and measure KV-cache memory to reduce prefill cost.

Set conservative iteration budgets and log tail-latency outliers to control costs.

Prototype mixed-size model routing: route planning to smaller models and critical reasoning to larger models.

Agent Features

Memory

short-term interaction historylong-term reflections (Reflexion)

Planning

explicit structured planning (DAG)tree search (LATS)

Tool Use

external APIs (Wikipedia, Wolfram Alpha)web interaction tools (WebShop)code execution (HumanEval Python runner)

Frameworks

ReActReflexionLATSLLMCompilerCoT

Is Agentic

Yes

Architectures

multi-step LLM pipelinesDAG planning (LLMCompiler)

Collaboration

not the focus (single-agent workflows analyzed)

Optimization Features

Token Efficiency

few-shot prompt tuning to reduce stepsprompt length trade-offs discussed

Infra Optimization

mixed-model routing (small+large)KV cache compression and pruning (discussed)

Model Optimization

quantization (discussed)distillation (discussed)sparse architectures (discussed)

System Optimization

concurrent request schedulingadaptive scaling and carbon-aware execution

Training Optimization

not central to this paper

Inference Optimization

prefix cachingtoken-level batching (vLLM)parallel LLM calls (LATS)prefill-decode disaggregation (discussed)speculative decoding (discussed)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/VIA-Research/AgentBench

Data URLs

HotpotQAWebShopMATHHumanEvalShareGPT

Risks & Boundaries

Limitations

Evaluations use two model sizes (8B, 70B) — extrapolation to much larger models is uncertain.

Energy and power estimates include GPU energy only; CPU, network, cooling overheads are omitted.

When Not To Use

If strict low per-query cost or strict SLA latency is required without infrastructure upgrades.

When external tools are extremely high-latency and cannot be internalized or cached.

Failure Modes

Long-tail outlier requests consume full iteration budget and inflate average cost.

GPU underutilization during tool waits causing inefficient amortization of hardware.

Core Entities

Models

Llama-3.1-8B-InstructLlama-3.1-70B-Instruct

Metrics

end-to-end latencythroughput (QPS)GPU energy (Wh/query)Accuracy

Datasets

HotpotQAWebShopMATHHumanEvalShareGPT

Benchmarks

HotpotQAWebShopMATHHumanEvalShareGPT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Agentic systems issue many more LLM calls per request than single-turn models.

LLM inference and tool execution split overall latency roughly two-to-one.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding