AnyTool: GPT-4 agent that searches 16k+ APIs via hierarchical retrieval and self-reflection

Overview

Decision SnapshotNeeds Validation

The method is well-documented and outperforms baselines on curated benchmarks, but it relies on GPT-4 function calling and shows high token/API-call costs that limit immediate low-cost deployment.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yu Du, Fangyun Wei, Hongyang Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

AnyTool is a GPT-4-based agent that searches a 16,000+ API pool without extra training. It uses a three-tier hierarchical API retriever (meta/category/tool agents), a solver (DFSDT or CoT), and a self-reflection loop that re-activates retriever+solver when solutions fail. The authors revise ToolBench's evaluation (filtering non-solvable queries) and add AnyToolBench. AnyTool achieves major gains: ~58.2% avg pass rate on filtered ToolBench and 73.8% on AnyToolBench, outperforming ToolLLM and plain GPT-4 variants. Costs: ~135k tokens and ~43 OpenAI calls per query on average.

Problem Statement

Finding correct APIs among 16K+ real-world endpoints is hard. Prior systems train a retriever and still miss relevant APIs and lack feedback. Existing evaluation (ToolBench) can inflate pass rates by counting non-solvable queries as passes. We need a practical, closed-loop agent that (1) scales search, (2) self-corrects, and (3) is evaluated realistically.

Main Contribution

A plug-and-play agent that runs on GPT-4 function calling and requires no extra model training.

A three-tier hierarchical API retriever (meta → category → tool agents) that narrows search across 16K+ RapidAPI entries.

Key Findings

AnyTool substantially improves real-task pass rates over prior systems.

Numbers+35.4% average pass rate vs ToolLLM on ToolBench (as reported)

Practical UseExpect much higher solved-rate when replacing a trained retriever + solver with AnyTool in similar API-heavy tasks.

Evidence RefAbstract; Table 1

AnyTool reaches 73.8% pass rate on the authors' AnyToolBench.

Numbers73.8% pass rate on AnyToolBench

Practical UseFor end-to-end tasks built from many real APIs, AnyTool delivers a usable majority of correct solutions on the authors' benchmark.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass rate (filtered ToolBench, average)	58.2%	ToolLLM variants (avg ~22.9–31.8%)	+~32.6 points vs best ToolLLM variant reported	Filtered ToolBench (six subsets)	Table 1 shows AnyTool avg 58.2% vs ToolLLM variants	Table 1
Pass rate (AnyToolBench)	73.8%	GPT-4 plain-agent 14.0%; ToolLLM 36.6%	+37.2 points vs ToolLLM (36.6%)	AnyToolBench (400 instances)	Table 2 reports 73.8% for AnyTool	Table 2

What To Try In 7 Days

Run AnyTool on a small internal API pool to compare solved-rate vs current pipeline.

Add a short self-reflection loop (4–6 cycles) to existing tool-call flows to boost success.

Measure token and API-call costs per query and set limits or batching to control spending.

Agent Features

Memory

per-agent historical context (short-term)global API-candidate pool

Planning

DFSDT decision-tree backtrackingChain-of-Thought (CoT) optionself-reflection-driven re-planning

Tool Use

GPT-4 function callingAPI selection and function invocation

Frameworks

GPT-4 function callingAutoGen-RAG (as a baseline comparator)

Is Agentic

Yes

Architectures

hierarchical meta→category→tool agentssingle solver agent

Collaboration

multi-threaded parallel agentsdivide-and-conquer category assignment

Optimization Features

Token Efficiency

token limit enforced (200k tokens) to cap costs

System Optimization

multi-threaded agents for parallel retrieval

Training Optimization

no additional model training required

Inference Optimization

hierarchical narrowing to reduce per-agent search scope

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/dyabel/AnyTool

Risks & Boundaries

Limitations

High token consumption and many OpenAI API calls per query (avg ~135k tokens, ~43 calls).

Relies on closed-source GPT-4 function-calling; local or offline deployment is not supported.

When Not To Use

When low-latency, low-cost, or on-device operation is required.

If you cannot call GPT-4 function-calling due to policy or cost limits.

Failure Modes

Retriever selects irrelevant APIs and GPT-4 labels them non-solvable, leading to wasted calls.

Solver 'Give Up' when required parameters or API behaviors are missing.

Core Entities

Models

GPT-4GPT-3.5ToolLLaMAToolLLMAnyTool

Metrics

pass rate

Datasets

ToolBench (filtered)AnyToolBench

Benchmarks

ToolBenchAnyToolBench

Context Entities

Models

text-embedding-ada-002all-mpnet-base-v2

Metrics

token consumptionOpenAI call count

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AnyTool substantially improves real-task pass rates over prior systems.

AnyTool reaches 73.8% pass rate on the authors' AnyToolBench.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding