Bundle repeated multi-step tool calls into deterministic 'meta-tools' to cut LLM calls, cost, and failures.

January 29, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

Links

Abstract / PDF

Why It Matters For Business

AWO provides a low-effort win for production agents: bundle repeated multi-step API sequences into single deterministic calls to cut inference cost and latency by ~5–15% on evaluated workloads while often improving success rates.

Summary TLDR

This paper presents AWO, a practical tool-discovery pipeline that scans past agent traces, merges equivalent states, and compiles repeated multi-step tool sequences into deterministic "meta-tools." On two agent benchmarks the method cut LLM calls by up to 11.9%, reduced token/cost usage (up to ~15% tokens in one benchmark), and in many settings raised task success by a few percentage points. AWO depends on domain-aware merging rules and works best when workloads contain frequent, repeatable subroutines like login or session init.

Problem Statement

Agentic LLMs make many costly LLM calls as they alternate reasoning and tool calls. Repeated, routine multi-step sequences (e.g., login, search-and-open) cause redundant LLM inference, adding latency, cost, and failure risk. The goal is to automatically find and replace recurring tool-call chains with single deterministic meta-tools to cut reasoning overhead without losing flexibility.

Main Contribution

AWO: a framework that builds a merged state graph from execution traces and extracts repeated tool-call chains as meta-tools.

An algorithm for horizontal and vertical graph merging plus greedy meta-tool extraction tuned by a threshold.

Empirical evaluation on VISUALWEBARENA and APPWORLD showing up to 11.9% fewer LLM calls, up to ~15% token cost savings, and small gains in task success.

Open-source release of AWO to enable reproduction and adoption.

Key Findings

AWO reduces the number of LLM calls on evaluated benchmarks.

NumbersLLM calls reduced up to 11.9% (APPWORLD, GPT 5.1)

Total token usage and monetary cost decreased when meta-tools are used.

NumbersTotal tokens −14.9% and cost −15.0% (APPWORLD, GPT 5.1)

Task success often improves after adding meta-tools.

NumbersTask success up to +4.2 percentage points reported (varies by split and model)

Meta-tool applicability depends on workload structure.

NumbersMeta-tool utilisation: 98.2% (APPWORLD) vs 16–31% (VISUALWEBARENA)

Horizontal merging requires domain knowledge and careful validation.

Results

LLM call reduction

Valueup to 11.9% fewer calls

Baselinebase toolset

Token usage reduction

Valueup to 14.9% fewer tokens

Baselinebase toolset

Monetary cost reduction

Valueup to 15.0% cost decrease

Baselinebase toolset

Task success rate change

Valueimprovements up to +4.2 percentage points

Baselinebase toolset

Meta-tool utilisation

Value98.2% utilisation

Baselineno meta-tools

Who Should Care

What To Try In 7 Days

Collect recent agent execution traces and inspect frequent prefixes.

Build a simple state graph and find the top 2–5 repeated prefixes.

Implement 1–3 meta-tools (e.g., auto-login) and run A/B tests measuring LLM calls, tokens, and success rate for a week of traffic.

Agent Features

Memory

  • execution traces
  • short-term state graph

Planning

  • iterative reasoning
  • stepwise action planning

Tool Use

  • function calling
  • meta-tools (composite tools)
  • tool merging

Frameworks

  • AWO
  • ReAct

Is Agentic

true

Architectures

  • ReAct loop
  • tool-calling agent

Optimization Features

Token Efficiency

  • remove entire output-token heavy steps by bundling
  • fewer new generations per task

Infra Optimization

  • lower LLM billing and end-to-end latency

System Optimization

  • state-graph merging and rule-based compression
  • deterministic composite tool execution

Inference Optimization

  • reduce total LLM invocations via meta-tools
  • shorter trajectories reduce chance of hallucination

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Horizontal merging needs domain expertise; unsafe merges can alter semantics.
  • Not all merged candidates are convertible to meta-tools due to input complexity.
  • Automated rule discovery is preliminary and can produce noisy or harmful rules.

When Not To Use

  • Workloads with highly dynamic, one-off tool sequences and no repeated prefixes.
  • Tools with complex or unsafe side effects that cannot be composed deterministically.
  • When you lack execution traces representative of future traffic.

Failure Modes

  • Applying a meta-tool that omits necessary intermediate checks causing incorrect side effects.
  • Automated regex/rule generation introducing bad merges that reduce accuracy.
  • LLM randomness causing occasional drops in success despite meta-tools (observed for some settings).

Core Entities

Models

  • GPT 5.1
  • Claude Sonnet 4.5
  • GPT-OSS 120B

Metrics

  • LLM call count
  • Token usage
  • Monetary cost
  • Task success rate
  • Meta-tool utilisation
  • Execution steps

Datasets

  • VISUALWEBARENA
  • APPWORLD

Benchmarks

  • VISUALWEBARENA
  • APPWORLD