ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

Overview

Decision SnapshotNeeds Validation

The work gives a practical dataset and methods with quantitative ablations and cross-domain tests; results are convincing but depend on ChatGPT for data and evaluation, and operational issues (latency, API drift) remain.

Citations63

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Who Should Care

ML Engineer Product Manager Founder Engineering Lead

Summary TLDR

This paper builds ToolBench, an instruction-tuning dataset of 16,464 real REST APIs and 126,486 instruction→solution pairs auto‑generated and annotated with ChatGPT. It proposes DFSDT, a depth-first decision-tree search that expands reasoning traces, and a neural API retriever to pick relevant tools. Fine‑tuning LLaMA‑2 (7B) on ToolBench yields ToolLLaMA, which (with DFSDT) achieves pass rates nearly on par with ChatGPT on the authors' tool-use tests and generalizes to unseen APIs and the APIBench benchmark. The project ships ToolEval (ChatGPT-based automatic evaluator) and the dataset/code on GitHub.

Problem Statement

Open-source LLMs lack robust tool-use: existing instruction tuning focuses on language tasks and small or simulated tool sets. Real-world use needs (1) many diverse, working APIs, (2) multi-tool and multi-step reasoning, (3) automatic evaluation, and (4) retrieval over large API pools.

Main Contribution

ToolBench: an instruction tuning dataset built from RapidAPI with 16,464 real REST APIs and 126,486 instruction→solution pairs that include multi-tool, multi-step scenarios.

DFSDT: a depth-first search decision-tree method to expand and evaluate multiple reasoning traces during API-based problem solving.

Key Findings

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers16,464 APIs; 126,486 instances; 469,585 real API calls

Practical UseYou can train models on a large, realistic API population so they learn diverse, multi-step tool workflows rather than toy single-tool cases.

Evidence RefTable 1, §2.1, §2.3

DFSDT greatly improves automated annotation success over ReACT on ChatGPT.

NumbersDFSDT avg pass rate 63.8% vs ReACT 35.3% (Table 3)

Practical UseUse DFSDT to find more valid multi-step API solution paths with the same budget of API calls when generating training data.

Evidence RefTable 3; §2.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
API retriever quality (avg NDCG@1)	78.0	OpenAI ada 49.6; BM25 18.5	≈+28.4 vs Ada	ToolBench retrieval eval (I1/I2/I3)	Table 2; §3.1	Table 2
DFSDT vs ReACT (avg pass rate)	63.8% (DFSDT)	35.3% (ReACT)	+28.5 pp	ToolBench annotation (ChatGPT as agent)	Table 3; §3.1	Table 3

What To Try In 7 Days

Run the ToolBench repo and inspect API docs for 10 target APIs relevant to your product.

Train or adapt a dense retriever on a small API doc set and measure NDCG@1/@5 on sample queries.

Prototype DFSDT-style decision branching for a 2–3 step workflow and compare pass rates to a single-trace method.

Agent Features

Memory

short-term state via multi-round conversation (no long-term memory)

Planning

DFSDT decision-tree (depth-first variant)ReACT (baseline)

Tool Use

Function calling of APIs (multi-round)Multi-tool orchestration

Frameworks

ToolEvalAPI retrieverRapidAPI-based tool registry

Is Agentic

Yes

Architectures

LLaMA-2 7B (fine-tuned)

Optimization Features

Token Efficiency

API response compression to 1024 tokens using ChatGPT heuristics

Training Optimization

Instruction tuning on solution pathsExtended context via positional interpolation to 8192 tokens

Inference Optimization

DFSDT pre-order traversal to reduce evaluator calls

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/OpenBMB/ToolBench

Data URLs

https://github.com/OpenBMB/ToolBench

Risks & Boundaries

Limitations

Data and annotations are generated and validated using ChatGPT, so dataset quality may inherit biases or failure modes of that model.

ToolEval is ChatGPT-based and, while correlated with humans, can still reflect evaluator biases (A.5).

When Not To Use

When you have no stable, documented APIs to expose to the model.

In ultra-low-latency settings where multi-round API calls are unacceptable.

Failure Modes

Model hallucination: inventing API outputs or final answers despite API evidence.

API drift or broken endpoints leading to failed solution paths after deployment.

Core Entities

Models

LLaMA-2 7BToolLLaMA (fine-tuned LLaMA-2 7B)ChatGPT (gpt-3.5-turbo-16k)GPT-4Text-Davinci-003Claude-2VicunaAlpacaGorilla

Metrics

Pass rateWin rateNDCG@1NDCG@5AccuracyHallucination rate

Datasets

ToolBenchAPIBench

Benchmarks

ToolBench evaluation setAPIBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

DFSDT greatly improves automated annotation success over ReACT on ChatGPT.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding