Bundle repeated multi-step tool calls into deterministic 'meta-tools' to cut LLM calls, cost, and failures.

Overview

Decision SnapshotReady For Pilot

AWO is practical for production agents with repeatable subroutines; it requires careful domain rules and verification to avoid unsafe merges.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

Links

Abstract / PDF

Why It Matters For Business

AWO provides a low-effort win for production agents: bundle repeated multi-step API sequences into single deterministic calls to cut inference cost and latency by ~5–15% on evaluated workloads while often improving success rates.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper presents AWO, a practical tool-discovery pipeline that scans past agent traces, merges equivalent states, and compiles repeated multi-step tool sequences into deterministic "meta-tools." On two agent benchmarks the method cut LLM calls by up to 11.9%, reduced token/cost usage (up to ~15% tokens in one benchmark), and in many settings raised task success by a few percentage points. AWO depends on domain-aware merging rules and works best when workloads contain frequent, repeatable subroutines like login or session init.

Problem Statement

Agentic LLMs make many costly LLM calls as they alternate reasoning and tool calls. Repeated, routine multi-step sequences (e.g., login, search-and-open) cause redundant LLM inference, adding latency, cost, and failure risk. The goal is to automatically find and replace recurring tool-call chains with single deterministic meta-tools to cut reasoning overhead without losing flexibility.

Main Contribution

AWO: a framework that builds a merged state graph from execution traces and extracts repeated tool-call chains as meta-tools.

An algorithm for horizontal and vertical graph merging plus greedy meta-tool extraction tuned by a threshold.

Key Findings

AWO reduces the number of LLM calls on evaluated benchmarks.

NumbersLLM calls reduced up to 11.9% (APPWORLD, GPT 5.1)

Practical UseAdd meta-tools for repeated subroutines to cut inference calls and lower billing and latency on similar agent workloads.

Evidence RefTable 9; Section 5.1

Total token usage and monetary cost decreased when meta-tools are used.

NumbersTotal tokens −14.9% and cost −15.0% (APPWORLD, GPT 5.1)

Practical UseExpect measurable billing reductions when your workload has many redundant LLM steps, since removing steps reduces expensive output tokens.

Evidence RefTable 1 and Table 11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLM call reduction	up to 11.9% fewer calls	base toolset	-11.9%	APPWORLD (GPT 5.1)	Total LLM call count 3665 → 3229 (Table 9)	Table 9
Token usage reduction	up to 14.9% fewer tokens	base toolset	-14.9%	APPWORLD (GPT 5.1)	Total tokens 34.1M → 29.0M (Table 1, Table 11)	Table 11

What To Try In 7 Days

Collect recent agent execution traces and inspect frequent prefixes.

Build a simple state graph and find the top 2–5 repeated prefixes.

Implement 1–3 meta-tools (e.g., auto-login) and run A/B tests measuring LLM calls, tokens, and success rate for a week of traffic.

Agent Features

Memory

execution tracesshort-term state graph

Planning

iterative reasoningstepwise action planning

Tool Use

function callingmeta-tools (composite tools)tool merging

Frameworks

AWOReAct

Is Agentic

Yes

Architectures

ReAct looptool-calling agent

Optimization Features

Token Efficiency

remove entire output-token heavy steps by bundlingfewer new generations per task

Infra Optimization

lower LLM billing and end-to-end latency

System Optimization

state-graph merging and rule-based compressiondeterministic composite tool execution

Inference Optimization

reduce total LLM invocations via meta-toolsshorter trajectories reduce chance of hallucination

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Risks & Boundaries

Limitations

Horizontal merging needs domain expertise; unsafe merges can alter semantics.

Not all merged candidates are convertible to meta-tools due to input complexity.

When Not To Use

Workloads with highly dynamic, one-off tool sequences and no repeated prefixes.

Tools with complex or unsafe side effects that cannot be composed deterministically.

Failure Modes

Applying a meta-tool that omits necessary intermediate checks causing incorrect side effects.

Automated regex/rule generation introducing bad merges that reduce accuracy.

Core Entities

Models

GPT 5.1Claude Sonnet 4.5GPT-OSS 120B

Metrics

LLM call countToken usageMonetary costTask success rateMeta-tool utilisationExecution steps

Datasets

VISUALWEBARENAAPPWORLD

Benchmarks

VISUALWEBARENAAPPWORLD

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AWO reduces the number of LLM calls on evaluated benchmarks.

Total token usage and monetary cost decreased when meta-tools are used.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding