Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
AWO provides a low-effort win for production agents: bundle repeated multi-step API sequences into single deterministic calls to cut inference cost and latency by ~5–15% on evaluated workloads while often improving success rates.
Summary TLDR
This paper presents AWO, a practical tool-discovery pipeline that scans past agent traces, merges equivalent states, and compiles repeated multi-step tool sequences into deterministic "meta-tools." On two agent benchmarks the method cut LLM calls by up to 11.9%, reduced token/cost usage (up to ~15% tokens in one benchmark), and in many settings raised task success by a few percentage points. AWO depends on domain-aware merging rules and works best when workloads contain frequent, repeatable subroutines like login or session init.
Problem Statement
Agentic LLMs make many costly LLM calls as they alternate reasoning and tool calls. Repeated, routine multi-step sequences (e.g., login, search-and-open) cause redundant LLM inference, adding latency, cost, and failure risk. The goal is to automatically find and replace recurring tool-call chains with single deterministic meta-tools to cut reasoning overhead without losing flexibility.
Main Contribution
AWO: a framework that builds a merged state graph from execution traces and extracts repeated tool-call chains as meta-tools.
An algorithm for horizontal and vertical graph merging plus greedy meta-tool extraction tuned by a threshold.
Empirical evaluation on VISUALWEBARENA and APPWORLD showing up to 11.9% fewer LLM calls, up to ~15% token cost savings, and small gains in task success.
Open-source release of AWO to enable reproduction and adoption.
Key Findings
AWO reduces the number of LLM calls on evaluated benchmarks.
Total token usage and monetary cost decreased when meta-tools are used.
Task success often improves after adding meta-tools.
Meta-tool applicability depends on workload structure.
Horizontal merging requires domain knowledge and careful validation.
Results
LLM call reduction
Token usage reduction
Monetary cost reduction
Task success rate change
Meta-tool utilisation
Who Should Care
What To Try In 7 Days
Collect recent agent execution traces and inspect frequent prefixes.
Build a simple state graph and find the top 2–5 repeated prefixes.
Implement 1–3 meta-tools (e.g., auto-login) and run A/B tests measuring LLM calls, tokens, and success rate for a week of traffic.
Agent Features
Memory
- execution traces
- short-term state graph
Planning
- iterative reasoning
- stepwise action planning
Tool Use
- function calling
- meta-tools (composite tools)
- tool merging
Frameworks
- AWO
- ReAct
Is Agentic
true
Architectures
- ReAct loop
- tool-calling agent
Optimization Features
Token Efficiency
- remove entire output-token heavy steps by bundling
- fewer new generations per task
Infra Optimization
- lower LLM billing and end-to-end latency
System Optimization
- state-graph merging and rule-based compression
- deterministic composite tool execution
Inference Optimization
- reduce total LLM invocations via meta-tools
- shorter trajectories reduce chance of hallucination
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Horizontal merging needs domain expertise; unsafe merges can alter semantics.
- Not all merged candidates are convertible to meta-tools due to input complexity.
- Automated rule discovery is preliminary and can produce noisy or harmful rules.
When Not To Use
- Workloads with highly dynamic, one-off tool sequences and no repeated prefixes.
- Tools with complex or unsafe side effects that cannot be composed deterministically.
- When you lack execution traces representative of future traffic.
Failure Modes
- Applying a meta-tool that omits necessary intermediate checks causing incorrect side effects.
- Automated regex/rule generation introducing bad merges that reduce accuracy.
- LLM randomness causing occasional drops in success despite meta-tools (observed for some settings).
Core Entities
Models
- GPT 5.1
- Claude Sonnet 4.5
- GPT-OSS 120B
Metrics
- LLM call count
- Token usage
- Monetary cost
- Task success rate
- Meta-tool utilisation
- Execution steps
Datasets
- VISUALWEBARENA
- APPWORLD
Benchmarks
- VISUALWEBARENA
- APPWORLD

