Agentic ROI: prioritize real user value, not raw model scores

May 23, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

0

Authors

Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, Weinan Zhang

Links

Abstract / PDF

Why It Matters For Business

Measure agent value as Agentic ROI (quality + time saved per dollar) to decide where to deploy agents profitably and avoid wasting resources on low-ROI, high-cost integrations.

Summary TLDR

This position paper argues that the real bottleneck for widespread LLM agent adoption is low Agentic ROI—the user-facing ratio of information gain and time savings to cost. The authors define Agentic ROI, demonstrate its use with a 34-person survey across five domains, and show high ROI in coding/research but low ROI in mass-market tasks like office work and e-commerce. They propose a zigzag roadmap: first "scale up" agents (sleep-time compute, multi-step reasoning, proactive interaction) to increase information gain and time savings, then "scale down" (memory retrieval, distillation, quantization, hardware-software co-optimization) to cut per-task cost. The paper is a strategic call to re-e

Problem Statement

LLM agents can technically automate many tasks, but many real-world uses deliver too little net benefit to users once time, prompting effort, verification, and cost are accounted for. The paper introduces Agentic ROI to measure whether deploying an agent actually improves users' utility compared to human or UI alternatives.

Main Contribution

Introduce Agentic ROI: a simple, actionable metric combining information gain, time savings, and monetary cost to evaluate agent usability.

Present a small empirical demonstration (n=34 survey) showing Agentic ROI correlates strongly with reported usability (r=0.95).

Describe a practical zigzag roadmap: scale up to raise information gain and time savings, then scale down to cut cost for mass-market adoption.

Highlight concrete engineering levers: sleep-time compute, multi-step capabilities, proactive interaction, memory retrieval, and model compression.

Key Findings

Reported agent usability across domains aligns tightly with computed Agentic ROI.

Numbersr = 0.95 correlation (survey analysis, Fig.1b)

High Agentic ROI appears in coding and scientific research; low ROI in office work, e-commerce, and personal assistance.

NumbersSurvey of 34 participants across five domains (coding, research, office, e-comm, personal)

Prompting overhead and verification time can erase time savings for short, well-structured tasks.

NumbersQualitative user reports in survey (prompting and verification offset T0)

Agentic ROI is personalizable: users with lower baseline skill often gain disproportionately large ROI.

NumbersArgument and illustrative examples in text (no large-n validation)

Results

Survey sample size

Value34 participants

Correlation between Agentic ROI and reported usability

Valuer = 0.95

Domain-level ROI trend

ValueHigh (coding, research) vs Low (office, e-commerce, personal)

Who Should Care

What To Try In 7 Days

Run a small ROI audit: pick one high-T0 workflow, log T0 and T_agent, and collect user quality ratings.

Add simple proactive features (prefilled templates, intent inference) to cut interaction time and re-measure ROI.

Pilot sleep-time compute or cached retrieval for repetitive tasks to estimate cost savings.

Agent Features

Memory

  • sleep-time compute (offline refinement)
  • long-term memory / retrieval
  • state persistence

Planning

  • long-horizon reasoning
  • iterative simulation
  • task decomposition

Tool Use

  • API integration
  • tool orchestration
  • external verification calls

Frameworks

  • n8n
  • LangChain
  • AutoGen
  • MetaGPT

Is Agentic

true

Architectures

  • multi-agent
  • generalist-to-specialist pipeline

Collaboration

  • agent swarms
  • multi-agent coordination

Optimization Features

Token Efficiency

  • speculative decoding
  • context compression

Infra Optimization

  • use of inference-optimized stacks (e.g., vLLM, FlashAttention)
  • AI-specific hardware co-design

Model Optimization

  • knowledge distillation
  • quantization
  • pruning
  • speculative decoding

System Optimization

  • memory retrieval instead of regeneration
  • state persistence to avoid recomputation

Training Optimization

  • specialization for sub-tasks
  • distillation from generalist to expert models

Inference Optimization

  • sleep-time compute precomputation
  • retrieval-based reasoning
  • hardware-software co-optimization

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Small empirical sample (34 survey responses) limits generalizability.
  • Cost estimates per task are coarse and normalized heuristically.
  • Survey is self-reported and domain selection is limited to five categories.
  • Correlation reported is associative, not causal.

When Not To Use

  • Short, single-step interactions where UI is faster (low T0 tasks).
  • Deterministic, repetitive processes best served by RPA or rule systems.
  • Sensitive settings where sleep-time compute raises privacy concerns without safeguards.

Failure Modes

  • Prompting and verification overhead can erase time savings, yielding negative ROI.
  • Agent hallucination or drift during long multi-step tasks causes extra verification.
  • High compute cost can make marginal accuracy gains uneconomical.
  • Inter-agent coordination overhead in swarms may reduce net benefit.

Core Entities

Models

  • GPT-5
  • Gemini-3
  • Qwen-3
  • DeepSeek-V3.2

Metrics

  • Agentic ROI
  • Information Gain
  • Time Savings
  • Cost
  • Usability (user ratings)

Benchmarks

  • GAIA
  • AndroidWorld
  • τ2-Bench
  • AI Index

Context Entities

Models

  • Gemini 3 pro
  • ChatGPT Pulse

Metrics

  • r (correlation coefficient)

Benchmarks

  • AndroidWorld
  • GAIA