ToolACE: auto-generates 26k verified APIs and complex dialogs to teach LLMs reliable function calling

Overview

Decision SnapshotNeeds Validation

The paper provides benchmark wins, ablations, and clear modules; evidence is strong for function-calling tasks on BFCL and API-Bank but limited to provided benchmarks and model sizes.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.

Who Should Care

ML Engineer Product Manager CTO Engineering Lead Founder CEO

Summary TLDR

ToolACE is an automated pipeline that synthesizes diverse, complex, and verified function-calling data for LLMs. It builds a 26,507-API pool via a self-evolution synthesis step, generates multi-agent dialogs guided by a loss-based complexity evaluator, and filters data via a rule checker plus model-based checks. Fine-tuning an 8B model (ToolACE-8B) with this data yields state-of-the-art function-calling performance on BFCL and API-Bank benchmarks, matching many API-backed systems on evaluated tasks. A subset of the model and data is publicly released.

Problem Statement

Real function-calling requires diverse, accurate, and complex examples, but collecting labeled real API dialogs is costly and public pipelines produce shallow, narrow samples. LLMs need training data matched to their capability and validated for executability. ToolACE aims to auto-synthesize large-scale, capability-aware, and verified function-calling data to improve zero-shot and fine-tuned tool use.

Main Contribution

A three-stage automated pipeline (Tool Self-Evolution Synthesis, Self-Guided Dialog Generation, Dual-Layer Verification) to create large-scale function-calling data.

A loss-based complexity evaluator that uses the target LLM to steer dialog difficulty so training samples fit model capacity.

Key Findings

ToolACE builds a very large synthetic API pool.

Numbers26,507 APIs across 390 domains

Practical UseTrain models on far broader API coverage to improve zero-shot and cross-domain function-calling.

Evidence RefTable 1; Abstract

An 8B model fine-tuned on ToolACE achieves top leaderboard performance for function calling.

NumbersBFCL overall ≈59.22; Non-live AST 89.27; Non-live Exec 90.07

Practical UseYou can reach near state-of-the-art function-calling with an 8B model plus targeted synthetic data instead of only larger proprietary models.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BFCL overall (ToolACE-8B)	59.22 (rank 3 on leaderboard)	GPT-4-turbo (59.49)	≈-0.27 vs top	BFCL-v3	Table 2 shows ToolACE-8B at 59.22 overall and rank 3	Table 2
BFCL Non-live AST (ToolACE-8B)	89.27	GPT-4-turbo 82.65	+6.62 vs GPT-4-turbo (on this submetric)	BFCL-v3 Non-live (AST)	Table 2 Non-live (A) column	Table 2

What To Try In 7 Days

Fine-tune your 8B LLM with a 25k-sample subset from ToolACE using LoRA to test function-calling gains.

Add a rule-based syntax checker for all generated function calls before model training.

Implement a model-based verifier to catch factual/parameter hallucinations in synthetic tool outputs.

Agent Features

Memory

API example buffer (iterative evolution)

Planning

self-guided complexity evaluation

Tool Use

function callingAPI synthesis

Frameworks

TSSSDGDLV

Is Agentic

Yes

Architectures

multi-agent generation

Collaboration

user-assistant-tool role-play

Optimization Features

Training Optimization

LoRAloss-based complexity sampling

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/Team-ACE

Data URLs

https://huggingface.co/Team-ACE (subset release announced)

Risks & Boundaries

Limitations

Complexity evaluation scales poorly with larger models and many samples; compute cost grows quickly.

ToolACE-trained models still lag top models (e.g., GPT-4) on broad reasoning benchmarks.

When Not To Use

You need large-scale general reasoning or SOTA multi-task performance across many non-tool benchmarks.

You require ground-truth live API execution during training rather than simulated tool responses.

Failure Modes

Verification misses can let hallucinated parameter values into training data, propagating errors.

Overfitting to synthetic API styles may reduce real-API generalization outside the sampled domains.

Core Entities

Models

ToolACE-8BLLaMA-3.1-8B-InstructQwen-1.5-xB-ChatxLAM-7b-fc-rGorilla-OpenFunctions-v2GPT-4 (for comparison)

Metrics

AccuracyRelevanceIrrelevance

Datasets

ToolACE synthetic dataset (full API pool)ToolACE subsets (easy/medium/hard; low/med/high diversity)

Benchmarks

BFCL (Berkeley Function-Calling Leaderboard)API-Bank

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ToolACE builds a very large synthetic API pool.

An 8B model fine-tuned on ToolACE achieves top leaderboard performance for function calling.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding