ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

July 31, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

63

Authors

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Summary TLDR

This paper builds ToolBench, an instruction-tuning dataset of 16,464 real REST APIs and 126,486 instruction→solution pairs auto‑generated and annotated with ChatGPT. It proposes DFSDT, a depth-first decision-tree search that expands reasoning traces, and a neural API retriever to pick relevant tools. Fine‑tuning LLaMA‑2 (7B) on ToolBench yields ToolLLaMA, which (with DFSDT) achieves pass rates nearly on par with ChatGPT on the authors' tool-use tests and generalizes to unseen APIs and the APIBench benchmark. The project ships ToolEval (ChatGPT-based automatic evaluator) and the dataset/code on GitHub.

Problem Statement

Open-source LLMs lack robust tool-use: existing instruction tuning focuses on language tasks and small or simulated tool sets. Real-world use needs (1) many diverse, working APIs, (2) multi-tool and multi-step reasoning, (3) automatic evaluation, and (4) retrieval over large API pools.

Main Contribution

ToolBench: an instruction tuning dataset built from RapidAPI with 16,464 real REST APIs and 126,486 instruction→solution pairs that include multi-tool, multi-step scenarios.

DFSDT: a depth-first search decision-tree method to expand and evaluate multiple reasoning traces during API-based problem solving.

ToolEval: an automatic ChatGPT-based evaluator reporting pass rate and win rate with high agreement to humans.

ToolLLaMA: a LLaMA-2 (7B) model fine-tuned on ToolBench plus a neural API retriever (Sentence-BERT based) to recommend APIs.

Key Findings

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers16,464 APIs; 126,486 instances; 469,585 real API calls

DFSDT greatly improves automated annotation success over ReACT on ChatGPT.

NumbersDFSDT avg pass rate 63.8% vs ReACT 35.3% (Table 3)

The neural API retriever strongly outperforms BM25 and OpenAI ada embeddings on retrieval.

NumbersAverage NDCG@1 78.0 vs Ada 49.6 and BM25 18.5 (Table 2)

ToolLLaMA (LLaMA‑2 7B fine-tuned on ToolBench) approaches ChatGPT performance on tool execution tasks.

NumbersToolLLaMA DFSDT avg pass 66.7% and win 60.0%; with retriever avg pass 67.3% win 63.1% (Table 4)

ToolEval (ChatGPT-based) correlates well with human judgments.

NumbersAgreement: pass rate 87.1%, win rate 80.3% (A.5)

ToolLLaMA generalizes to out-of-distribution APIBench domains without training on them.

NumbersToolLLaMA+retriever AST accuracy: TorchHub 51.16%, TensorHub 40.59%, HuggingFace 16.77% (Table 5)

Results

API retriever quality (avg NDCG@1)

Value78.0

BaselineOpenAI ada 49.6; BM25 18.5

DFSDT vs ReACT (avg pass rate)

Value63.8% (DFSDT)

Baseline35.3% (ReACT)

ToolLLaMA main performance (avg pass rate / win rate)

Value66.7% pass, 60.0% win (DFSDT, oracle APIs)

BaselineChatGPT DFSDT 64.8% pass avg, GPT-4 DFSDT 71.1% pass avg

ToolLLaMA with retriever (avg pass / win)

Value67.3% pass, 63.1% win

BaselineToolLLaMA with oracle APIs 66.7%/60.0%

ToolEval agreement with humans

ValuePass 87.1%, Win 80.3%

Accuracy

ValueTorchHub 51.16%; TensorHub 40.59%; HuggingFace 16.77%

BaselineGorilla-ZS+BM25 TorchHub 44.62%; HuggingFace 10.51%

Who Should Care

What To Try In 7 Days

Run the ToolBench repo and inspect API docs for 10 target APIs relevant to your product.

Train or adapt a dense retriever on a small API doc set and measure NDCG@1/@5 on sample queries.

Prototype DFSDT-style decision branching for a 2–3 step workflow and compare pass rates to a single-trace method.

Agent Features

Memory

  • short-term state via multi-round conversation (no long-term memory)

Planning

  • DFSDT decision-tree (depth-first variant)
  • ReACT (baseline)

Tool Use

  • Function calling of APIs (multi-round)
  • Multi-tool orchestration

Frameworks

  • ToolEval
  • API retriever
  • RapidAPI-based tool registry

Is Agentic

true

Architectures

  • LLaMA-2 7B (fine-tuned)

Optimization Features

Token Efficiency

  • API response compression to 1024 tokens using ChatGPT heuristics

Training Optimization

  • Instruction tuning on solution paths
  • Extended context via positional interpolation to 8192 tokens

Inference Optimization

  • DFSDT pre-order traversal to reduce evaluator calls

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Data and annotations are generated and validated using ChatGPT, so dataset quality may inherit biases or failure modes of that model.
  • ToolEval is ChatGPT-based and, while correlated with humans, can still reflect evaluator biases (A.5).
  • APIs on RapidAPI change over time; dataset relies on initial filtering and compressed responses that may become stale.
  • DFSDT increases API-call costs during annotation and requires careful budget tuning.
  • ToolLLaMA is fine-tuned on LLaMA‑2 7B and may not scale identically to much larger or smaller models.

When Not To Use

  • When you have no stable, documented APIs to expose to the model.
  • In ultra-low-latency settings where multi-round API calls are unacceptable.
  • If you cannot afford annotation or runtime costs for multi-path search during deployment.

Failure Modes

  • Model hallucination: inventing API outputs or final answers despite API evidence.
  • API drift or broken endpoints leading to failed solution paths after deployment.
  • Retriever misses the correct API, causing chain failures.
  • Excessive or redundant API calls increasing cost without progress.

Core Entities

Models

  • LLaMA-2 7B
  • ToolLLaMA (fine-tuned LLaMA-2 7B)
  • ChatGPT (gpt-3.5-turbo-16k)
  • GPT-4
  • Text-Davinci-003
  • Claude-2
  • Vicuna
  • Alpaca
  • Gorilla

Metrics

  • Pass rate
  • Win rate
  • NDCG@1
  • NDCG@5
  • Accuracy
  • Hallucination rate

Datasets

  • ToolBench
  • APIBench

Benchmarks

  • ToolBench evaluation set
  • APIBench