Bundle repeated multi-step tool calls into deterministic 'meta-tools' to cut LLM calls, cost, and failures.

January 29, 20267 min

Overview

Decision SnapshotReady For Pilot

AWO is practical for production agents with repeatable subroutines; it requires careful domain rules and verification to avoid unsafe merges.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

Links

Abstract / PDF

Why It Matters For Business

AWO provides a low-effort win for production agents: bundle repeated multi-step API sequences into single deterministic calls to cut inference cost and latency by ~5–15% on evaluated workloads while often improving success rates.

Who Should Care

Summary TLDR

This paper presents AWO, a practical tool-discovery pipeline that scans past agent traces, merges equivalent states, and compiles repeated multi-step tool sequences into deterministic "meta-tools." On two agent benchmarks the method cut LLM calls by up to 11.9%, reduced token/cost usage (up to ~15% tokens in one benchmark), and in many settings raised task success by a few percentage points. AWO depends on domain-aware merging rules and works best when workloads contain frequent, repeatable subroutines like login or session init.

Problem Statement

Agentic LLMs make many costly LLM calls as they alternate reasoning and tool calls. Repeated, routine multi-step sequences (e.g., login, search-and-open) cause redundant LLM inference, adding latency, cost, and failure risk. The goal is to automatically find and replace recurring tool-call chains with single deterministic meta-tools to cut reasoning overhead without losing flexibility.

Main Contribution

AWO: a framework that builds a merged state graph from execution traces and extracts repeated tool-call chains as meta-tools.

An algorithm for horizontal and vertical graph merging plus greedy meta-tool extraction tuned by a threshold.

Key Findings

AWO reduces the number of LLM calls on evaluated benchmarks.

NumbersLLM calls reduced up to 11.9% (APPWORLD, GPT 5.1)

Practical UseAdd meta-tools for repeated subroutines to cut inference calls and lower billing and latency on similar agent workloads.

Evidence RefTable 9; Section 5.1

Total token usage and monetary cost decreased when meta-tools are used.

NumbersTotal tokens −14.9% and cost −15.0% (APPWORLD, GPT 5.1)

Practical UseExpect measurable billing reductions when your workload has many redundant LLM steps, since removing steps reduces expensive output tokens.

Evidence RefTable 1 and Table 11

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LLM call reductionup to 11.9% fewer callsbase toolset-11.9%APPWORLD (GPT 5.1)Total LLM call count 3665 → 3229 (Table 9)Table 9
Token usage reductionup to 14.9% fewer tokensbase toolset-14.9%APPWORLD (GPT 5.1)Total tokens 34.1M → 29.0M (Table 1, Table 11)Table 11

What To Try In 7 Days

Collect recent agent execution traces and inspect frequent prefixes.

Build a simple state graph and find the top 2–5 repeated prefixes.

Implement 1–3 meta-tools (e.g., auto-login) and run A/B tests measuring LLM calls, tokens, and success rate for a week of traffic.

Agent Features

Memory
execution tracesshort-term state graph
Planning
iterative reasoningstepwise action planning
Tool Use
function callingmeta-tools (composite tools)tool merging
Frameworks
AWOReAct
Is Agentic

Yes

Architectures
ReAct looptool-calling agent

Optimization Features

Token Efficiency
remove entire output-token heavy steps by bundlingfewer new generations per task
Infra Optimization
lower LLM billing and end-to-end latency
System Optimization
state-graph merging and rule-based compressiondeterministic composite tool execution
Inference Optimization
reduce total LLM invocations via meta-toolsshorter trajectories reduce chance of hallucination

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Horizontal merging needs domain expertise; unsafe merges can alter semantics.

Not all merged candidates are convertible to meta-tools due to input complexity.

When Not To Use

Workloads with highly dynamic, one-off tool sequences and no repeated prefixes.

Tools with complex or unsafe side effects that cannot be composed deterministically.

Failure Modes

Applying a meta-tool that omits necessary intermediate checks causing incorrect side effects.

Automated regex/rule generation introducing bad merges that reduce accuracy.

Core Entities

Models

GPT 5.1Claude Sonnet 4.5GPT-OSS 120B

Metrics

LLM call countToken usageMonetary costTask success rateMeta-tool utilisationExecution steps

Datasets

VISUALWEBARENAAPPWORLD

Benchmarks

VISUALWEBARENAAPPWORLD