AnyTool: GPT-4 agent that searches 16k+ APIs via hierarchical retrieval and self-reflection

February 6, 20248 min

Overview

Decision SnapshotNeeds Validation

The method is well-documented and outperforms baselines on curated benchmarks, but it relies on GPT-4 function calling and shows high token/API-call costs that limit immediate low-cost deployment.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yu Du, Fangyun Wei, Hongyang Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.

Who Should Care

Summary TLDR

AnyTool is a GPT-4-based agent that searches a 16,000+ API pool without extra training. It uses a three-tier hierarchical API retriever (meta/category/tool agents), a solver (DFSDT or CoT), and a self-reflection loop that re-activates retriever+solver when solutions fail. The authors revise ToolBench's evaluation (filtering non-solvable queries) and add AnyToolBench. AnyTool achieves major gains: ~58.2% avg pass rate on filtered ToolBench and 73.8% on AnyToolBench, outperforming ToolLLM and plain GPT-4 variants. Costs: ~135k tokens and ~43 OpenAI calls per query on average.

Problem Statement

Finding correct APIs among 16K+ real-world endpoints is hard. Prior systems train a retriever and still miss relevant APIs and lack feedback. Existing evaluation (ToolBench) can inflate pass rates by counting non-solvable queries as passes. We need a practical, closed-loop agent that (1) scales search, (2) self-corrects, and (3) is evaluated realistically.

Main Contribution

A plug-and-play agent that runs on GPT-4 function calling and requires no extra model training.

A three-tier hierarchical API retriever (meta → category → tool agents) that narrows search across 16K+ RapidAPI entries.

Key Findings

AnyTool substantially improves real-task pass rates over prior systems.

Numbers+35.4% average pass rate vs ToolLLM on ToolBench (as reported)

Practical UseExpect much higher solved-rate when replacing a trained retriever + solver with AnyTool in similar API-heavy tasks.

Evidence RefAbstract; Table 1

AnyTool reaches 73.8% pass rate on the authors' AnyToolBench.

Numbers73.8% pass rate on AnyToolBench

Practical UseFor end-to-end tasks built from many real APIs, AnyTool delivers a usable majority of correct solutions on the authors' benchmark.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass rate (filtered ToolBench, average)58.2%ToolLLM variants (avg ~22.931.8%)+~32.6 points vs best ToolLLM variant reportedFiltered ToolBench (six subsets)Table 1 shows AnyTool avg 58.2% vs ToolLLM variantsTable 1
Pass rate (AnyToolBench)73.8%GPT-4 plain-agent 14.0%; ToolLLM 36.6%+37.2 points vs ToolLLM (36.6%)AnyToolBench (400 instances)Table 2 reports 73.8% for AnyToolTable 2

What To Try In 7 Days

Run AnyTool on a small internal API pool to compare solved-rate vs current pipeline.

Add a short self-reflection loop (4–6 cycles) to existing tool-call flows to boost success.

Measure token and API-call costs per query and set limits or batching to control spending.

Agent Features

Memory
per-agent historical context (short-term)global API-candidate pool
Planning
DFSDT decision-tree backtrackingChain-of-Thought (CoT) optionself-reflection-driven re-planning
Tool Use
GPT-4 function callingAPI selection and function invocation
Frameworks
GPT-4 function callingAutoGen-RAG (as a baseline comparator)
Is Agentic

Yes

Architectures
hierarchical meta→category→tool agentssingle solver agent
Collaboration
multi-threaded parallel agentsdivide-and-conquer category assignment

Optimization Features

Token Efficiency
token limit enforced (200k tokens) to cap costs
System Optimization
multi-threaded agents for parallel retrieval
Training Optimization
no additional model training required
Inference Optimization
hierarchical narrowing to reduce per-agent search scope

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High token consumption and many OpenAI API calls per query (avg ~135k tokens, ~43 calls).

Relies on closed-source GPT-4 function-calling; local or offline deployment is not supported.

When Not To Use

When low-latency, low-cost, or on-device operation is required.

If you cannot call GPT-4 function-calling due to policy or cost limits.

Failure Modes

Retriever selects irrelevant APIs and GPT-4 labels them non-solvable, leading to wasted calls.

Solver 'Give Up' when required parameters or API behaviors are missing.

Core Entities

Models

GPT-4GPT-3.5ToolLLaMAToolLLMAnyTool

Metrics

pass rate

Datasets

ToolBench (filtered)AnyToolBench

Benchmarks

ToolBenchAnyToolBench

Context Entities

Models

text-embedding-ada-002all-mpnet-base-v2

Metrics

token consumptionOpenAI call count