API-Bank: a large, runnable benchmark and training set to measure and improve LLMs' API/tool use; includes Lynx, a fine-tuned model.

Overview

Decision SnapshotReady For Pilot

Benchmark and dataset are concrete and runnable; empirical results and error analysis are based on a manually reviewed evaluation set and clear metrics. Limitations include English-only focus and single fine-tuned model reported.

Citations12

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 50%

Novelty: 70%

Authors

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead

Summary TLDR

API-Bank is a runnable benchmark and training corpus for measuring and improving how LLMs plan, retrieve, and call APIs. It provides an evaluation system with 73 implemented APIs and 314 manually annotated dialogues, a 1,888-dialogue training set produced by a five-agent data generator, and a fine-tuned model (Lynx-7B) that improves API-call accuracy over Alpaca by ~26 points but still trails GPT-4.

Problem Statement

Current LLMs can use external tools but we lack a comprehensive, realistic benchmark and training data to measure and improve three skills: calling known APIs, retrieving which API to call, and planning multi-step API calls.

Main Contribution

API-Bank evaluation system: 73 runnable APIs, 314 human-annotated dialogues, 753 API calls for authentic testing.

Large training corpus: 1,888 dialogues, 2,138 APIs, produced with a five-agent multi-agent generator that cuts annotation cost by ~98%.

Key Findings

API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.

Numbers73 APIs; 314 dialogues; 753 API calls

Practical UseYou can run real API calls in a controlled setting to measure if an LLM issues correct API names, parameters, and sequences.

Evidence RefAbstract; Section 3.1; Section 3.2

Multi-agent generation reduced per-dialogue annotation cost from $8 to $0.1, a 98% cost saving, while producing high-quality data.

Numbers$8 → $0.1 per dialogue; 98% cost reduction

Practical UseTeams can cheaply scale tool-use training data generation using staged LLM agents instead of full manual annotation.

Evidence RefSection 4; Section 4 Multi-agent description

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall API-call correctness (zero-shot / fine-tuned depends on model)	GPT-4 60.24%; GPT-3.5 47.16%; Lynx-7B 39.58%; Alpaca-7B 15.19%; GPT-3 Davinci 0.57%	—	—	API-Bank evaluation (314 dialogues)	Table 3; Section 7.2	Table 3
Call ability correctness (given API docs)	GPT-4 63.66%; GPT-3.5 59.40%; Lynx-7B 49.87%; Alpaca-7B 24.06%; GPT-3 Davinci 0.50%	—	Lynx vs Alpaca +25.8 pts	Call subset (API-Bank eval)	Table 3; Section 7.2	Table 3

What To Try In 7 Days

Run the API-Bank evaluation on your LLM to measure current API call and retrieval accuracy.

Fine-tune an instruction-tuned base model on a small slice of API-Bank training data to reduce format and missing-parameter errors.

Use the multi-agent data generator pattern to synthesize additional domain-specific API dialogues and validate with a small human review loop.

Agent Features

Memory

short-term dialogue history (turn-level)

Planning

Plan+Retrieve+Call evaluationmulti-step API call planning

Tool Use

API callingAPI retrieval via API Search

Frameworks

API Search (embedding + cosine similarity)Multi-agent data generator (five LLM agents)

Is Agentic

Yes

Architectures

LLaMA-7B initialization (Lynx)

Collaboration

Multi-agent generation: five cooperating agents

Optimization Features

Training Optimization

Fine-tuning Alpaca-7B on API-Bank for 3 epochs (batch 256, lr 2e-5)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank

Data URLs

https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank

Risks & Boundaries

Limitations

Benchmark and data are English-only.

Reported fine-tuning results focus on one model size (Lynx-7B); larger-model effects not shown.

When Not To Use

If you need multilingual API evaluation (API-Bank is English-only).

If you require end-to-end production safety certification beyond format and correctness checks.

Failure Modes

API hallucination: model invents API names or calls not in the pool.

Failed API retrieval: model fails to find correct API via API Search.

Core Entities

Models

Alpaca-7BLynx-7BGPT-3 DavinciGPT-3.5-turboGPT-4ChatGLM-6BLLaMA-7B

Metrics

AccuracyROUGE-L (response quality)

Datasets

API-Bank (training)API-Bank (evaluation)ToolAlpacaAPIBenchToolBench1

Benchmarks

API-BankToolAlpacaAPIBenchToolBench1

Context Entities

Models

GPT-4ChatGPT

Metrics

AccuracyROUGE-L

Datasets

Public APIs (examples used in prompts)

Benchmarks

ToolAlpacaAPIBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.

Multi-agent generation reduced per-dialogue annotation cost from $8 to $0.1, a 98% cost saving, while producing high-quality data.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding