API-Bank: a large, runnable benchmark and training set to measure and improve LLMs' API/tool use; includes Lynx, a fine-tuned model.

April 14, 20238 min

Overview

Decision SnapshotReady For Pilot

Benchmark and dataset are concrete and runnable; empirical results and error analysis are based on a manually reviewed evaluation set and clear metrics. Limitations include English-only focus and single fine-tuned model reported.

Citations12

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 50%

Novelty: 70%

Authors

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.

Who Should Care

Summary TLDR

API-Bank is a runnable benchmark and training corpus for measuring and improving how LLMs plan, retrieve, and call APIs. It provides an evaluation system with 73 implemented APIs and 314 manually annotated dialogues, a 1,888-dialogue training set produced by a five-agent data generator, and a fine-tuned model (Lynx-7B) that improves API-call accuracy over Alpaca by ~26 points but still trails GPT-4.

Problem Statement

Current LLMs can use external tools but we lack a comprehensive, realistic benchmark and training data to measure and improve three skills: calling known APIs, retrieving which API to call, and planning multi-step API calls.

Main Contribution

API-Bank evaluation system: 73 runnable APIs, 314 human-annotated dialogues, 753 API calls for authentic testing.

Large training corpus: 1,888 dialogues, 2,138 APIs, produced with a five-agent multi-agent generator that cuts annotation cost by ~98%.

Key Findings

API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.

Numbers73 APIs; 314 dialogues; 753 API calls

Practical UseYou can run real API calls in a controlled setting to measure if an LLM issues correct API names, parameters, and sequences.

Evidence RefAbstract; Section 3.1; Section 3.2

Multi-agent generation reduced per-dialogue annotation cost from $8 to $0.1, a 98% cost saving, while producing high-quality data.

Numbers$8$0.1 per dialogue; 98% cost reduction

Practical UseTeams can cheaply scale tool-use training data generation using staged LLM agents instead of full manual annotation.

Evidence RefSection 4; Section 4 Multi-agent description

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall API-call correctness (zero-shot / fine-tuned depends on model)GPT-4 60.24%; GPT-3.5 47.16%; Lynx-7B 39.58%; Alpaca-7B 15.19%; GPT-3 Davinci 0.57%API-Bank evaluation (314 dialogues)Table 3; Section 7.2Table 3
Call ability correctness (given API docs)GPT-4 63.66%; GPT-3.5 59.40%; Lynx-7B 49.87%; Alpaca-7B 24.06%; GPT-3 Davinci 0.50%Lynx vs Alpaca +25.8 ptsCall subset (API-Bank eval)Table 3; Section 7.2Table 3

What To Try In 7 Days

Run the API-Bank evaluation on your LLM to measure current API call and retrieval accuracy.

Fine-tune an instruction-tuned base model on a small slice of API-Bank training data to reduce format and missing-parameter errors.

Use the multi-agent data generator pattern to synthesize additional domain-specific API dialogues and validate with a small human review loop.

Agent Features

Memory
short-term dialogue history (turn-level)
Planning
Plan+Retrieve+Call evaluationmulti-step API call planning
Tool Use
API callingAPI retrieval via API Search
Frameworks
API Search (embedding + cosine similarity)Multi-agent data generator (five LLM agents)
Is Agentic

Yes

Architectures
LLaMA-7B initialization (Lynx)
Collaboration
Multi-agent generation: five cooperating agents

Optimization Features

Training Optimization
Fine-tuning Alpaca-7B on API-Bank for 3 epochs (batch 256, lr 2e-5)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark and data are English-only.

Reported fine-tuning results focus on one model size (Lynx-7B); larger-model effects not shown.

When Not To Use

If you need multilingual API evaluation (API-Bank is English-only).

If you require end-to-end production safety certification beyond format and correctness checks.

Failure Modes

API hallucination: model invents API names or calls not in the pool.

Failed API retrieval: model fails to find correct API via API Search.

Core Entities

Models

Alpaca-7BLynx-7BGPT-3 DavinciGPT-3.5-turboGPT-4ChatGLM-6BLLaMA-7B

Metrics

AccuracyROUGE-L (response quality)

Datasets

API-Bank (training)API-Bank (evaluation)ToolAlpacaAPIBenchToolBench1

Benchmarks

API-BankToolAlpacaAPIBenchToolBench1

Context Entities

Models

GPT-4ChatGPT

Metrics

AccuracyROUGE-L

Datasets

Public APIs (examples used in prompts)

Benchmarks

ToolAlpacaAPIBench