Overview
AWO is practical for production agents with repeatable subroutines; it requires careful domain rules and verification to avoid unsafe merges.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
AWO provides a low-effort win for production agents: bundle repeated multi-step API sequences into single deterministic calls to cut inference cost and latency by ~5–15% on evaluated workloads while often improving success rates.
Who Should Care
Summary TLDR
This paper presents AWO, a practical tool-discovery pipeline that scans past agent traces, merges equivalent states, and compiles repeated multi-step tool sequences into deterministic "meta-tools." On two agent benchmarks the method cut LLM calls by up to 11.9%, reduced token/cost usage (up to ~15% tokens in one benchmark), and in many settings raised task success by a few percentage points. AWO depends on domain-aware merging rules and works best when workloads contain frequent, repeatable subroutines like login or session init.
Problem Statement
Agentic LLMs make many costly LLM calls as they alternate reasoning and tool calls. Repeated, routine multi-step sequences (e.g., login, search-and-open) cause redundant LLM inference, adding latency, cost, and failure risk. The goal is to automatically find and replace recurring tool-call chains with single deterministic meta-tools to cut reasoning overhead without losing flexibility.
Main Contribution
AWO: a framework that builds a merged state graph from execution traces and extracts repeated tool-call chains as meta-tools.
An algorithm for horizontal and vertical graph merging plus greedy meta-tool extraction tuned by a threshold.
Key Findings
AWO reduces the number of LLM calls on evaluated benchmarks.
Total token usage and monetary cost decreased when meta-tools are used.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LLM call reduction | up to 11.9% fewer calls | base toolset | -11.9% | APPWORLD (GPT 5.1) | Total LLM call count 3665 → 3229 (Table 9) | Table 9 |
| Token usage reduction | up to 14.9% fewer tokens | base toolset | -14.9% | APPWORLD (GPT 5.1) | Total tokens 34.1M → 29.0M (Table 1, Table 11) | Table 11 |
What To Try In 7 Days
Collect recent agent execution traces and inspect frequent prefixes.
Build a simple state graph and find the top 2–5 repeated prefixes.
Implement 1–3 meta-tools (e.g., auto-login) and run A/B tests measuring LLM calls, tokens, and success rate for a week of traffic.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Horizontal merging needs domain expertise; unsafe merges can alter semantics.
Not all merged candidates are convertible to meta-tools due to input complexity.
When Not To Use
Workloads with highly dynamic, one-off tool sequences and no repeated prefixes.
Tools with complex or unsafe side effects that cannot be composed deterministically.
Failure Modes
Applying a meta-tool that omits necessary intermediate checks causing incorrect side effects.
Automated regex/rule generation introducing bad merges that reduce accuracy.

