ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

July 31, 20239 min

Overview

Decision SnapshotNeeds Validation

The work gives a practical dataset and methods with quantitative ablations and cross-domain tests; results are convincing but depend on ChatGPT for data and evaluation, and operational issues (latency, API drift) remain.

Citations63

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Who Should Care

Summary TLDR

This paper builds ToolBench, an instruction-tuning dataset of 16,464 real REST APIs and 126,486 instruction→solution pairs auto‑generated and annotated with ChatGPT. It proposes DFSDT, a depth-first decision-tree search that expands reasoning traces, and a neural API retriever to pick relevant tools. Fine‑tuning LLaMA‑2 (7B) on ToolBench yields ToolLLaMA, which (with DFSDT) achieves pass rates nearly on par with ChatGPT on the authors' tool-use tests and generalizes to unseen APIs and the APIBench benchmark. The project ships ToolEval (ChatGPT-based automatic evaluator) and the dataset/code on GitHub.

Problem Statement

Open-source LLMs lack robust tool-use: existing instruction tuning focuses on language tasks and small or simulated tool sets. Real-world use needs (1) many diverse, working APIs, (2) multi-tool and multi-step reasoning, (3) automatic evaluation, and (4) retrieval over large API pools.

Main Contribution

ToolBench: an instruction tuning dataset built from RapidAPI with 16,464 real REST APIs and 126,486 instruction→solution pairs that include multi-tool, multi-step scenarios.

DFSDT: a depth-first search decision-tree method to expand and evaluate multiple reasoning traces during API-based problem solving.

Key Findings

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers16,464 APIs; 126,486 instances; 469,585 real API calls

Practical UseYou can train models on a large, realistic API population so they learn diverse, multi-step tool workflows rather than toy single-tool cases.

Evidence RefTable 1, §2.1, §2.3

DFSDT greatly improves automated annotation success over ReACT on ChatGPT.

NumbersDFSDT avg pass rate 63.8% vs ReACT 35.3% (Table 3)

Practical UseUse DFSDT to find more valid multi-step API solution paths with the same budget of API calls when generating training data.

Evidence RefTable 3; §2.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
API retriever quality (avg NDCG@1)78.0OpenAI ada 49.6; BM25 18.5≈+28.4 vs AdaToolBench retrieval eval (I1/I2/I3)Table 2; §3.1Table 2
DFSDT vs ReACT (avg pass rate)63.8% (DFSDT)35.3% (ReACT)+28.5 ppToolBench annotation (ChatGPT as agent)Table 3; §3.1Table 3

What To Try In 7 Days

Run the ToolBench repo and inspect API docs for 10 target APIs relevant to your product.

Train or adapt a dense retriever on a small API doc set and measure NDCG@1/@5 on sample queries.

Prototype DFSDT-style decision branching for a 2–3 step workflow and compare pass rates to a single-trace method.

Agent Features

Memory
short-term state via multi-round conversation (no long-term memory)
Planning
DFSDT decision-tree (depth-first variant)ReACT (baseline)
Tool Use
Function calling of APIs (multi-round)Multi-tool orchestration
Frameworks
ToolEvalAPI retrieverRapidAPI-based tool registry
Is Agentic

Yes

Architectures
LLaMA-2 7B (fine-tuned)

Optimization Features

Token Efficiency
API response compression to 1024 tokens using ChatGPT heuristics
Training Optimization
Instruction tuning on solution pathsExtended context via positional interpolation to 8192 tokens
Inference Optimization
DFSDT pre-order traversal to reduce evaluator calls

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Data and annotations are generated and validated using ChatGPT, so dataset quality may inherit biases or failure modes of that model.

ToolEval is ChatGPT-based and, while correlated with humans, can still reflect evaluator biases (A.5).

When Not To Use

When you have no stable, documented APIs to expose to the model.

In ultra-low-latency settings where multi-round API calls are unacceptable.

Failure Modes

Model hallucination: inventing API outputs or final answers despite API evidence.

API drift or broken endpoints leading to failed solution paths after deployment.

Core Entities

Models

LLaMA-2 7BToolLLaMA (fine-tuned LLaMA-2 7B)ChatGPT (gpt-3.5-turbo-16k)GPT-4Text-Davinci-003Claude-2VicunaAlpacaGorilla

Metrics

Pass rateWin rateNDCG@1NDCG@5AccuracyHallucination rate

Datasets

ToolBenchAPIBench

Benchmarks

ToolBench evaluation setAPIBench