Overview
Benchmark and dataset are concrete and runnable; empirical results and error analysis are based on a manually reviewed evaluation set and clear metrics. Limitations include English-only focus and single fine-tuned model reported.
Citations12
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 50%
Novelty: 70%
Why It Matters For Business
API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.
Who Should Care
Summary TLDR
API-Bank is a runnable benchmark and training corpus for measuring and improving how LLMs plan, retrieve, and call APIs. It provides an evaluation system with 73 implemented APIs and 314 manually annotated dialogues, a 1,888-dialogue training set produced by a five-agent data generator, and a fine-tuned model (Lynx-7B) that improves API-call accuracy over Alpaca by ~26 points but still trails GPT-4.
Problem Statement
Current LLMs can use external tools but we lack a comprehensive, realistic benchmark and training data to measure and improve three skills: calling known APIs, retrieving which API to call, and planning multi-step API calls.
Main Contribution
API-Bank evaluation system: 73 runnable APIs, 314 human-annotated dialogues, 753 API calls for authentic testing.
Large training corpus: 1,888 dialogues, 2,138 APIs, produced with a five-agent multi-agent generator that cuts annotation cost by ~98%.
Key Findings
API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.
Multi-agent generation reduced per-dialogue annotation cost from $8 to $0.1, a 98% cost saving, while producing high-quality data.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall API-call correctness (zero-shot / fine-tuned depends on model) | GPT-4 60.24%; GPT-3.5 47.16%; Lynx-7B 39.58%; Alpaca-7B 15.19%; GPT-3 Davinci 0.57% | — | — | API-Bank evaluation (314 dialogues) | Table 3; Section 7.2 | Table 3 |
| Call ability correctness (given API docs) | GPT-4 63.66%; GPT-3.5 59.40%; Lynx-7B 49.87%; Alpaca-7B 24.06%; GPT-3 Davinci 0.50% | — | Lynx vs Alpaca +25.8 pts | Call subset (API-Bank eval) | Table 3; Section 7.2 | Table 3 |
What To Try In 7 Days
Run the API-Bank evaluation on your LLM to measure current API call and retrieval accuracy.
Fine-tune an instruction-tuned base model on a small slice of API-Bank training data to reduce format and missing-parameter errors.
Use the multi-agent data generator pattern to synthesize additional domain-specific API dialogues and validate with a small human review loop.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmark and data are English-only.
Reported fine-tuning results focus on one model size (Lynx-7B); larger-model effects not shown.
When Not To Use
If you need multilingual API evaluation (API-Bank is English-only).
If you require end-to-end production safety certification beyond format and correctness checks.
Failure Modes
API hallucination: model invents API names or calls not in the pool.
Failed API retrieval: model fails to find correct API via API Search.

