Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.
Summary TLDR
AnyTool is a GPT-4-based agent that searches a 16,000+ API pool without extra training. It uses a three-tier hierarchical API retriever (meta/category/tool agents), a solver (DFSDT or CoT), and a self-reflection loop that re-activates retriever+solver when solutions fail. The authors revise ToolBench's evaluation (filtering non-solvable queries) and add AnyToolBench. AnyTool achieves major gains: ~58.2% avg pass rate on filtered ToolBench and 73.8% on AnyToolBench, outperforming ToolLLM and plain GPT-4 variants. Costs: ~135k tokens and ~43 OpenAI calls per query on average.
Problem Statement
Finding correct APIs among 16K+ real-world endpoints is hard. Prior systems train a retriever and still miss relevant APIs and lack feedback. Existing evaluation (ToolBench) can inflate pass rates by counting non-solvable queries as passes. We need a practical, closed-loop agent that (1) scales search, (2) self-corrects, and (3) is evaluated realistically.
Main Contribution
A plug-and-play agent that runs on GPT-4 function calling and requires no extra model training.
A three-tier hierarchical API retriever (meta → category → tool agents) that narrows search across 16K+ RapidAPI entries.
A solver using DFSDT (depth-first search decision tree) or Chain-of-Thought to call candidate APIs and return solutions.
A self-reflection loop that re-activates retriever and solver when solutions fail, expanding the candidate pool selectively.
A critique of ToolBench evaluation and the release of AnyToolBench plus a revised pass-rate protocol that filters non-solvable queries.
Key Findings
AnyTool substantially improves real-task pass rates over prior systems.
AnyTool reaches 73.8% pass rate on the authors' AnyToolBench.
Self-reflection yields meaningful gains quickly.
Both the hierarchical retriever and self-reflection are critical.
Resource and cost are high per query.
Results
Pass rate (filtered ToolBench, average)
Pass rate (AnyToolBench)
Self-reflection benefit
Resource use per query
Who Should Care
What To Try In 7 Days
Run AnyTool on a small internal API pool to compare solved-rate vs current pipeline.
Add a short self-reflection loop (4–6 cycles) to existing tool-call flows to boost success.
Measure token and API-call costs per query and set limits or batching to control spending.
Agent Features
Memory
- per-agent historical context (short-term)
- global API-candidate pool
Planning
- DFSDT decision-tree backtracking
- Chain-of-Thought (CoT) option
- self-reflection-driven re-planning
Tool Use
- GPT-4 function calling
- API selection and function invocation
Frameworks
- GPT-4 function calling
- AutoGen-RAG (as a baseline comparator)
Is Agentic
true
Architectures
- hierarchical meta→category→tool agents
- single solver agent
Collaboration
- multi-threaded parallel agents
- divide-and-conquer category assignment
Optimization Features
Token Efficiency
- token limit enforced (200k tokens) to cap costs
System Optimization
- multi-threaded agents for parallel retrieval
Training Optimization
- no additional model training required
Inference Optimization
- hierarchical narrowing to reduce per-agent search scope
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High token consumption and many OpenAI API calls per query (avg ~135k tokens, ~43 calls).
- Relies on closed-source GPT-4 function-calling; local or offline deployment is not supported.
- Benchmarks were filtered to remove non-solvable queries; real-world success may vary on noisier inputs.
- Not validated on very complex scenarios beyond provided benchmarks.
When Not To Use
- When low-latency, low-cost, or on-device operation is required.
- If you cannot call GPT-4 function-calling due to policy or cost limits.
- For trivially small API pools where simpler retrieval is cheaper.
Failure Modes
- Retriever selects irrelevant APIs and GPT-4 labels them non-solvable, leading to wasted calls.
- Solver 'Give Up' when required parameters or API behaviors are missing.
- Server instability and variable GPT-4 response times can stall pipelines.
- Potential judge bias if GPT-4 evaluation diverges from real user requirements.
Core Entities
Models
- GPT-4
- GPT-3.5
- ToolLLaMA
- ToolLLM
- AnyTool
Metrics
- pass rate
Datasets
- ToolBench (filtered)
- AnyToolBench
Benchmarks
- ToolBench
- AnyToolBench
Context Entities
Models
- text-embedding-ada-002
- all-mpnet-base-v2
Metrics
- token consumption
- OpenAI call count

