Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
12
Why It Matters For Business
API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.
Summary TLDR
API-Bank is a runnable benchmark and training corpus for measuring and improving how LLMs plan, retrieve, and call APIs. It provides an evaluation system with 73 implemented APIs and 314 manually annotated dialogues, a 1,888-dialogue training set produced by a five-agent data generator, and a fine-tuned model (Lynx-7B) that improves API-call accuracy over Alpaca by ~26 points but still trails GPT-4.
Problem Statement
Current LLMs can use external tools but we lack a comprehensive, realistic benchmark and training data to measure and improve three skills: calling known APIs, retrieving which API to call, and planning multi-step API calls.
Main Contribution
API-Bank evaluation system: 73 runnable APIs, 314 human-annotated dialogues, 753 API calls for authentic testing.
Large training corpus: 1,888 dialogues, 2,138 APIs, produced with a five-agent multi-agent generator that cuts annotation cost by ~98%.
Trained Lynx (Alpaca-7B → fine-tuned) and open benchmark for comparing LLMs on Call, Retrieve+Call, and Plan+Retrieve+Call abilities; error analysis highlighting main failure modes.
Key Findings
API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.
Multi-agent generation reduced per-dialogue annotation cost from $8 to $0.1, a 98% cost saving, while producing high-quality data.
Fine-tuning on API-Bank lifts Alpaca-7B's API call correctness by ~26 percentage points (24.06% → 49.87%), bringing it closer to GPT-3.5 but still ~21 points behind GPT-4 on overall correctness.
Base GPT-3 Davinci shows almost no API-usage ability on this benchmark (overall correctness ~0.57%).
Main failure modes across models are retrieval failure, API hallucination, wrong/invalid input parameters, and unparseable API call formats.
Results
Overall API-call correctness (zero-shot / fine-tuned depends on model)
Call ability correctness (given API docs)
Retrieve+Call correctness
Plan+Retrieve+Call correctness
Who Should Care
What To Try In 7 Days
Run the API-Bank evaluation on your LLM to measure current API call and retrieval accuracy.
Fine-tune an instruction-tuned base model on a small slice of API-Bank training data to reduce format and missing-parameter errors.
Use the multi-agent data generator pattern to synthesize additional domain-specific API dialogues and validate with a small human review loop.
Agent Features
Memory
- short-term dialogue history (turn-level)
Planning
- Plan+Retrieve+Call evaluation
- multi-step API call planning
Tool Use
- API calling
- API retrieval via API Search
Frameworks
- API Search (embedding + cosine similarity)
- Multi-agent data generator (five LLM agents)
Is Agentic
true
Architectures
- LLaMA-7B initialization (Lynx)
Collaboration
- Multi-agent generation: five cooperating agents
Optimization Features
Training Optimization
- Fine-tuning Alpaca-7B on API-Bank for 3 epochs (batch 256, lr 2e-5)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark and data are English-only.
- Reported fine-tuning results focus on one model size (Lynx-7B); larger-model effects not shown.
- Training data is largely automated; synthetic data may diverge from some real-world API behaviors despite human filtering.
When Not To Use
- If you need multilingual API evaluation (API-Bank is English-only).
- If you require end-to-end production safety certification beyond format and correctness checks.
- If target APIs are highly proprietary and differ substantially from API-Bank's simulated APIs.
Failure Modes
- API hallucination: model invents API names or calls not in the pool.
- Failed API retrieval: model fails to find correct API via API Search.
- Invalid input parameters: wrong formats or missing required fields.
- False API call format: generated call cannot be parsed by the system.
Core Entities
Models
- Alpaca-7B
- Lynx-7B
- GPT-3 Davinci
- GPT-3.5-turbo
- GPT-4
- ChatGLM-6B
- LLaMA-7B
Metrics
- Accuracy
- ROUGE-L (response quality)
Datasets
- API-Bank (training)
- API-Bank (evaluation)
- ToolAlpaca
- APIBench
- ToolBench1
Benchmarks
- API-Bank
- ToolAlpaca
- APIBench
- ToolBench1
Context Entities
Models
- GPT-4
- ChatGPT
Metrics
- Accuracy
- ROUGE-L
Datasets
- Public APIs (examples used in prompts)
Benchmarks
- ToolAlpaca
- APIBench

