Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
63
Why It Matters For Business
If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.
Summary TLDR
This paper builds ToolBench, an instruction-tuning dataset of 16,464 real REST APIs and 126,486 instruction→solution pairs auto‑generated and annotated with ChatGPT. It proposes DFSDT, a depth-first decision-tree search that expands reasoning traces, and a neural API retriever to pick relevant tools. Fine‑tuning LLaMA‑2 (7B) on ToolBench yields ToolLLaMA, which (with DFSDT) achieves pass rates nearly on par with ChatGPT on the authors' tool-use tests and generalizes to unseen APIs and the APIBench benchmark. The project ships ToolEval (ChatGPT-based automatic evaluator) and the dataset/code on GitHub.
Problem Statement
Open-source LLMs lack robust tool-use: existing instruction tuning focuses on language tasks and small or simulated tool sets. Real-world use needs (1) many diverse, working APIs, (2) multi-tool and multi-step reasoning, (3) automatic evaluation, and (4) retrieval over large API pools.
Main Contribution
ToolBench: an instruction tuning dataset built from RapidAPI with 16,464 real REST APIs and 126,486 instruction→solution pairs that include multi-tool, multi-step scenarios.
DFSDT: a depth-first search decision-tree method to expand and evaluate multiple reasoning traces during API-based problem solving.
ToolEval: an automatic ChatGPT-based evaluator reporting pass rate and win rate with high agreement to humans.
ToolLLaMA: a LLaMA-2 (7B) model fine-tuned on ToolBench plus a neural API retriever (Sentence-BERT based) to recommend APIs.
Key Findings
ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.
DFSDT greatly improves automated annotation success over ReACT on ChatGPT.
The neural API retriever strongly outperforms BM25 and OpenAI ada embeddings on retrieval.
ToolLLaMA (LLaMA‑2 7B fine-tuned on ToolBench) approaches ChatGPT performance on tool execution tasks.
ToolEval (ChatGPT-based) correlates well with human judgments.
ToolLLaMA generalizes to out-of-distribution APIBench domains without training on them.
Results
API retriever quality (avg NDCG@1)
DFSDT vs ReACT (avg pass rate)
ToolLLaMA main performance (avg pass rate / win rate)
ToolLLaMA with retriever (avg pass / win)
ToolEval agreement with humans
Accuracy
Who Should Care
What To Try In 7 Days
Run the ToolBench repo and inspect API docs for 10 target APIs relevant to your product.
Train or adapt a dense retriever on a small API doc set and measure NDCG@1/@5 on sample queries.
Prototype DFSDT-style decision branching for a 2–3 step workflow and compare pass rates to a single-trace method.
Agent Features
Memory
- short-term state via multi-round conversation (no long-term memory)
Planning
- DFSDT decision-tree (depth-first variant)
- ReACT (baseline)
Tool Use
- Function calling of APIs (multi-round)
- Multi-tool orchestration
Frameworks
- ToolEval
- API retriever
- RapidAPI-based tool registry
Is Agentic
true
Architectures
- LLaMA-2 7B (fine-tuned)
Optimization Features
Token Efficiency
- API response compression to 1024 tokens using ChatGPT heuristics
Training Optimization
- Instruction tuning on solution paths
- Extended context via positional interpolation to 8192 tokens
Inference Optimization
- DFSDT pre-order traversal to reduce evaluator calls
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data and annotations are generated and validated using ChatGPT, so dataset quality may inherit biases or failure modes of that model.
- ToolEval is ChatGPT-based and, while correlated with humans, can still reflect evaluator biases (A.5).
- APIs on RapidAPI change over time; dataset relies on initial filtering and compressed responses that may become stale.
- DFSDT increases API-call costs during annotation and requires careful budget tuning.
- ToolLLaMA is fine-tuned on LLaMA‑2 7B and may not scale identically to much larger or smaller models.
When Not To Use
- When you have no stable, documented APIs to expose to the model.
- In ultra-low-latency settings where multi-round API calls are unacceptable.
- If you cannot afford annotation or runtime costs for multi-path search during deployment.
Failure Modes
- Model hallucination: inventing API outputs or final answers despite API evidence.
- API drift or broken endpoints leading to failed solution paths after deployment.
- Retriever misses the correct API, causing chain failures.
- Excessive or redundant API calls increasing cost without progress.
Core Entities
Models
- LLaMA-2 7B
- ToolLLaMA (fine-tuned LLaMA-2 7B)
- ChatGPT (gpt-3.5-turbo-16k)
- GPT-4
- Text-Davinci-003
- Claude-2
- Vicuna
- Alpaca
- Gorilla
Metrics
- Pass rate
- Win rate
- NDCG@1
- NDCG@5
- Accuracy
- Hallucination rate
Datasets
- ToolBench
- APIBench
Benchmarks
- ToolBench evaluation set
- APIBench

