AnyTool: GPT-4 agent that searches 16k+ APIs via hierarchical retrieval and self-reflection

February 6, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

1

Authors

Yu Du, Fangyun Wei, Hongyang Zhang

Links

Abstract / PDF

Why It Matters For Business

AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.

Summary TLDR

AnyTool is a GPT-4-based agent that searches a 16,000+ API pool without extra training. It uses a three-tier hierarchical API retriever (meta/category/tool agents), a solver (DFSDT or CoT), and a self-reflection loop that re-activates retriever+solver when solutions fail. The authors revise ToolBench's evaluation (filtering non-solvable queries) and add AnyToolBench. AnyTool achieves major gains: ~58.2% avg pass rate on filtered ToolBench and 73.8% on AnyToolBench, outperforming ToolLLM and plain GPT-4 variants. Costs: ~135k tokens and ~43 OpenAI calls per query on average.

Problem Statement

Finding correct APIs among 16K+ real-world endpoints is hard. Prior systems train a retriever and still miss relevant APIs and lack feedback. Existing evaluation (ToolBench) can inflate pass rates by counting non-solvable queries as passes. We need a practical, closed-loop agent that (1) scales search, (2) self-corrects, and (3) is evaluated realistically.

Main Contribution

A plug-and-play agent that runs on GPT-4 function calling and requires no extra model training.

A three-tier hierarchical API retriever (meta → category → tool agents) that narrows search across 16K+ RapidAPI entries.

A solver using DFSDT (depth-first search decision tree) or Chain-of-Thought to call candidate APIs and return solutions.

A self-reflection loop that re-activates retriever and solver when solutions fail, expanding the candidate pool selectively.

A critique of ToolBench evaluation and the release of AnyToolBench plus a revised pass-rate protocol that filters non-solvable queries.

Key Findings

AnyTool substantially improves real-task pass rates over prior systems.

Numbers+35.4% average pass rate vs ToolLLM on ToolBench (as reported)

AnyTool reaches 73.8% pass rate on the authors' AnyToolBench.

Numbers73.8% pass rate on AnyToolBench

Self-reflection yields meaningful gains quickly.

NumbersUp to ~20% pass-rate improvement with 4–6 reflection rounds

Both the hierarchical retriever and self-reflection are critical.

NumbersG2-I drops from 58.9% to 22.4% when hierarchy removed

Resource and cost are high per query.

NumbersAvg 13.5×10^4 tokens, 43.3 OpenAI calls, 14.1 API candidates per query

Results

Pass rate (filtered ToolBench, average)

Value58.2%

BaselineToolLLM variants (avg ~22.9–31.8%)

Pass rate (AnyToolBench)

Value73.8%

BaselineGPT-4 plain-agent 14.0%; ToolLLM 36.6%

Self-reflection benefit

ValueUp to ~20% increase

BaselineNo self-reflection

Resource use per query

ValueAverage 135k tokens; 43.3 OpenAI calls; 14.1 API candidates

Who Should Care

What To Try In 7 Days

Run AnyTool on a small internal API pool to compare solved-rate vs current pipeline.

Add a short self-reflection loop (4–6 cycles) to existing tool-call flows to boost success.

Measure token and API-call costs per query and set limits or batching to control spending.

Agent Features

Memory

  • per-agent historical context (short-term)
  • global API-candidate pool

Planning

  • DFSDT decision-tree backtracking
  • Chain-of-Thought (CoT) option
  • self-reflection-driven re-planning

Tool Use

  • GPT-4 function calling
  • API selection and function invocation

Frameworks

  • GPT-4 function calling
  • AutoGen-RAG (as a baseline comparator)

Is Agentic

true

Architectures

  • hierarchical meta→category→tool agents
  • single solver agent

Collaboration

  • multi-threaded parallel agents
  • divide-and-conquer category assignment

Optimization Features

Token Efficiency

  • token limit enforced (200k tokens) to cap costs

System Optimization

  • multi-threaded agents for parallel retrieval

Training Optimization

  • no additional model training required

Inference Optimization

  • hierarchical narrowing to reduce per-agent search scope

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High token consumption and many OpenAI API calls per query (avg ~135k tokens, ~43 calls).
  • Relies on closed-source GPT-4 function-calling; local or offline deployment is not supported.
  • Benchmarks were filtered to remove non-solvable queries; real-world success may vary on noisier inputs.
  • Not validated on very complex scenarios beyond provided benchmarks.

When Not To Use

  • When low-latency, low-cost, or on-device operation is required.
  • If you cannot call GPT-4 function-calling due to policy or cost limits.
  • For trivially small API pools where simpler retrieval is cheaper.

Failure Modes

  • Retriever selects irrelevant APIs and GPT-4 labels them non-solvable, leading to wasted calls.
  • Solver 'Give Up' when required parameters or API behaviors are missing.
  • Server instability and variable GPT-4 response times can stall pipelines.
  • Potential judge bias if GPT-4 evaluation diverges from real user requirements.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • ToolLLaMA
  • ToolLLM
  • AnyTool

Metrics

  • pass rate

Datasets

  • ToolBench (filtered)
  • AnyToolBench

Benchmarks

  • ToolBench
  • AnyToolBench

Context Entities

Models

  • text-embedding-ada-002
  • all-mpnet-base-v2

Metrics

  • token consumption
  • OpenAI call count