ToolACE: auto-generates 26k verified APIs and complex dialogs to teach LLMs reliable function calling

September 2, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper provides benchmark wins, ablations, and clear modules; evidence is strong for function-calling tasks on BFCL and API-Bank but limited to provided benchmarks and model sizes.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.

Who Should Care

Summary TLDR

ToolACE is an automated pipeline that synthesizes diverse, complex, and verified function-calling data for LLMs. It builds a 26,507-API pool via a self-evolution synthesis step, generates multi-agent dialogs guided by a loss-based complexity evaluator, and filters data via a rule checker plus model-based checks. Fine-tuning an 8B model (ToolACE-8B) with this data yields state-of-the-art function-calling performance on BFCL and API-Bank benchmarks, matching many API-backed systems on evaluated tasks. A subset of the model and data is publicly released.

Problem Statement

Real function-calling requires diverse, accurate, and complex examples, but collecting labeled real API dialogs is costly and public pipelines produce shallow, narrow samples. LLMs need training data matched to their capability and validated for executability. ToolACE aims to auto-synthesize large-scale, capability-aware, and verified function-calling data to improve zero-shot and fine-tuned tool use.

Main Contribution

A three-stage automated pipeline (Tool Self-Evolution Synthesis, Self-Guided Dialog Generation, Dual-Layer Verification) to create large-scale function-calling data.

A loss-based complexity evaluator that uses the target LLM to steer dialog difficulty so training samples fit model capacity.

Key Findings

ToolACE builds a very large synthetic API pool.

Numbers26,507 APIs across 390 domains

Practical UseTrain models on far broader API coverage to improve zero-shot and cross-domain function-calling.

Evidence RefTable 1; Abstract

An 8B model fine-tuned on ToolACE achieves top leaderboard performance for function calling.

NumbersBFCL overall ≈59.22; Non-live AST 89.27; Non-live Exec 90.07

Practical UseYou can reach near state-of-the-art function-calling with an 8B model plus targeted synthetic data instead of only larger proprietary models.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BFCL overall (ToolACE-8B)59.22 (rank 3 on leaderboard)GPT-4-turbo (59.49)≈-0.27 vs topBFCL-v3Table 2 shows ToolACE-8B at 59.22 overall and rank 3Table 2
BFCL Non-live AST (ToolACE-8B)89.27GPT-4-turbo 82.65+6.62 vs GPT-4-turbo (on this submetric)BFCL-v3 Non-live (AST)Table 2 Non-live (A) columnTable 2

What To Try In 7 Days

Fine-tune your 8B LLM with a 25k-sample subset from ToolACE using LoRA to test function-calling gains.

Add a rule-based syntax checker for all generated function calls before model training.

Implement a model-based verifier to catch factual/parameter hallucinations in synthetic tool outputs.

Agent Features

Memory
API example buffer (iterative evolution)
Planning
self-guided complexity evaluation
Tool Use
function callingAPI synthesis
Frameworks
TSSSDGDLV
Is Agentic

Yes

Architectures
multi-agent generation
Collaboration
user-assistant-tool role-play

Optimization Features

Training Optimization
LoRAloss-based complexity sampling

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Complexity evaluation scales poorly with larger models and many samples; compute cost grows quickly.

ToolACE-trained models still lag top models (e.g., GPT-4) on broad reasoning benchmarks.

When Not To Use

You need large-scale general reasoning or SOTA multi-task performance across many non-tool benchmarks.

You require ground-truth live API execution during training rather than simulated tool responses.

Failure Modes

Verification misses can let hallucinated parameter values into training data, propagating errors.

Overfitting to synthetic API styles may reduce real-API generalization outside the sampled domains.

Core Entities

Models

ToolACE-8BLLaMA-3.1-8B-InstructQwen-1.5-xB-ChatxLAM-7b-fc-rGorilla-OpenFunctions-v2GPT-4 (for comparison)

Metrics

AccuracyRelevanceIrrelevance

Datasets

ToolACE synthetic dataset (full API pool)ToolACE subsets (easy/medium/hard; low/med/high diversity)

Benchmarks

BFCL (Berkeley Function-Calling Leaderboard)API-Bank