API-Bank: a large, runnable benchmark and training set to measure and improve LLMs' API/tool use; includes Lynx, a fine-tuned model.

April 14, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

12

Authors

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li

Links

Abstract / PDF

Why It Matters For Business

API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.

Summary TLDR

API-Bank is a runnable benchmark and training corpus for measuring and improving how LLMs plan, retrieve, and call APIs. It provides an evaluation system with 73 implemented APIs and 314 manually annotated dialogues, a 1,888-dialogue training set produced by a five-agent data generator, and a fine-tuned model (Lynx-7B) that improves API-call accuracy over Alpaca by ~26 points but still trails GPT-4.

Problem Statement

Current LLMs can use external tools but we lack a comprehensive, realistic benchmark and training data to measure and improve three skills: calling known APIs, retrieving which API to call, and planning multi-step API calls.

Main Contribution

API-Bank evaluation system: 73 runnable APIs, 314 human-annotated dialogues, 753 API calls for authentic testing.

Large training corpus: 1,888 dialogues, 2,138 APIs, produced with a five-agent multi-agent generator that cuts annotation cost by ~98%.

Trained Lynx (Alpaca-7B → fine-tuned) and open benchmark for comparing LLMs on Call, Retrieve+Call, and Plan+Retrieve+Call abilities; error analysis highlighting main failure modes.

Key Findings

API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.

Numbers73 APIs; 314 dialogues; 753 API calls

Multi-agent generation reduced per-dialogue annotation cost from $8 to $0.1, a 98% cost saving, while producing high-quality data.

Numbers$8 → $0.1 per dialogue; 98% cost reduction

Fine-tuning on API-Bank lifts Alpaca-7B's API call correctness by ~26 percentage points (24.06% → 49.87%), bringing it closer to GPT-3.5 but still ~21 points behind GPT-4 on overall correctness.

NumbersAlpaca Call 24.06% → Lynx Call 49.87% (+25.8 pts); Lynx total 39.58% vs GPT-4 60.24% (−20.66 pts)

Base GPT-3 Davinci shows almost no API-usage ability on this benchmark (overall correctness ~0.57%).

NumbersGPT-3 Davinci overall correctness 0.57%

Main failure modes across models are retrieval failure, API hallucination, wrong/invalid input parameters, and unparseable API call formats.

NumbersExamples: GPT-4 failed API retrieval 67.86%; Lynx API hallucination 61.38%; Alpaca 'No API Call' 36.77%

Results

Overall API-call correctness (zero-shot / fine-tuned depends on model)

ValueGPT-4 60.24%; GPT-3.5 47.16%; Lynx-7B 39.58%; Alpaca-7B 15.19%; GPT-3 Davinci 0.57%

Call ability correctness (given API docs)

ValueGPT-4 63.66%; GPT-3.5 59.40%; Lynx-7B 49.87%; Alpaca-7B 24.06%; GPT-3 Davinci 0.50%

Retrieve+Call correctness

ValueGPT-3.5 38.52%; GPT-4 37.04%; Lynx-7B 30.37%; Alpaca-7B 5.19%

Plan+Retrieve+Call correctness

ValueGPT-4 70.00%; GPT-3.5 22.00%; Lynx-7B 20.00%; Alpaca-7B 0.00%

Who Should Care

What To Try In 7 Days

Run the API-Bank evaluation on your LLM to measure current API call and retrieval accuracy.

Fine-tune an instruction-tuned base model on a small slice of API-Bank training data to reduce format and missing-parameter errors.

Use the multi-agent data generator pattern to synthesize additional domain-specific API dialogues and validate with a small human review loop.

Agent Features

Memory

  • short-term dialogue history (turn-level)

Planning

  • Plan+Retrieve+Call evaluation
  • multi-step API call planning

Tool Use

  • API calling
  • API retrieval via API Search

Frameworks

  • API Search (embedding + cosine similarity)
  • Multi-agent data generator (five LLM agents)

Is Agentic

true

Architectures

  • LLaMA-7B initialization (Lynx)

Collaboration

  • Multi-agent generation: five cooperating agents

Optimization Features

Training Optimization

  • Fine-tuning Alpaca-7B on API-Bank for 3 epochs (batch 256, lr 2e-5)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark and data are English-only.
  • Reported fine-tuning results focus on one model size (Lynx-7B); larger-model effects not shown.
  • Training data is largely automated; synthetic data may diverge from some real-world API behaviors despite human filtering.

When Not To Use

  • If you need multilingual API evaluation (API-Bank is English-only).
  • If you require end-to-end production safety certification beyond format and correctness checks.
  • If target APIs are highly proprietary and differ substantially from API-Bank's simulated APIs.

Failure Modes

  • API hallucination: model invents API names or calls not in the pool.
  • Failed API retrieval: model fails to find correct API via API Search.
  • Invalid input parameters: wrong formats or missing required fields.
  • False API call format: generated call cannot be parsed by the system.

Core Entities

Models

  • Alpaca-7B
  • Lynx-7B
  • GPT-3 Davinci
  • GPT-3.5-turbo
  • GPT-4
  • ChatGLM-6B
  • LLaMA-7B

Metrics

  • Accuracy
  • ROUGE-L (response quality)

Datasets

  • API-Bank (training)
  • API-Bank (evaluation)
  • ToolAlpaca
  • APIBench
  • ToolBench1

Benchmarks

  • API-Bank
  • ToolAlpaca
  • APIBench
  • ToolBench1

Context Entities

Models

  • GPT-4
  • ChatGPT

Metrics

  • Accuracy
  • ROUGE-L

Datasets

  • Public APIs (examples used in prompts)

Benchmarks

  • ToolAlpaca
  • APIBench