ToolACE: auto-generates 26k verified APIs and complex dialogs to teach LLMs reliable function calling

September 2, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

2

Authors

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen

Links

Abstract / PDF

Why It Matters For Business

ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.

Summary TLDR

ToolACE is an automated pipeline that synthesizes diverse, complex, and verified function-calling data for LLMs. It builds a 26,507-API pool via a self-evolution synthesis step, generates multi-agent dialogs guided by a loss-based complexity evaluator, and filters data via a rule checker plus model-based checks. Fine-tuning an 8B model (ToolACE-8B) with this data yields state-of-the-art function-calling performance on BFCL and API-Bank benchmarks, matching many API-backed systems on evaluated tasks. A subset of the model and data is publicly released.

Problem Statement

Real function-calling requires diverse, accurate, and complex examples, but collecting labeled real API dialogs is costly and public pipelines produce shallow, narrow samples. LLMs need training data matched to their capability and validated for executability. ToolACE aims to auto-synthesize large-scale, capability-aware, and verified function-calling data to improve zero-shot and fine-tuned tool use.

Main Contribution

A three-stage automated pipeline (Tool Self-Evolution Synthesis, Self-Guided Dialog Generation, Dual-Layer Verification) to create large-scale function-calling data.

A loss-based complexity evaluator that uses the target LLM to steer dialog difficulty so training samples fit model capacity.

A dual-layer verification system (rule checker + model checkers) to ensure syntactic executability and content correctness of generated function calls.

Key Findings

ToolACE builds a very large synthetic API pool.

Numbers26,507 APIs across 390 domains

An 8B model fine-tuned on ToolACE achieves top leaderboard performance for function calling.

NumbersBFCL overall ≈59.22; Non-live AST 89.27; Non-live Exec 90.07

ToolACE matches or beats open-source baselines on API-Bank.

NumbersAPI-Bank Call accuracy 75.94; Retrieval+Call 47.41

Dual-layer verification meaningfully improves final model performance.

NumbersToolACE (Final) overall 58.19 vs no verification 24.9 (other datasets); ablative drop when removing model checker (Fig.3

Balanced complexity is best for learning.

NumbersMedium-complexity subset outperforms easy/hard subsets on BFCL (Figure 4)

Data diversity matters; removing multi-type or parallel samples hurts behavior detection.

NumbersRemoving multi-type drops irrelevance detection to 6.99% (Table 7)

Results

BFCL overall (ToolACE-8B)

Value59.22 (rank 3 on leaderboard)

BaselineGPT-4-turbo (59.49)

BFCL Non-live AST (ToolACE-8B)

Value89.27

BaselineGPT-4-turbo 82.65

Accuracy

Value75.94

Baselinegpt-4-0613 75.94 (tie)

Training-data comparison (overall after 25k samples)

ValueToolACE 58.19 vs xLAM 40.51 vs ToolLLM 24.9

BaselinexLAM and ToolLLM (same sample budgets)

Ablation: remove multi-type

ValueOverall drops to 42.71; Irrelevance 6.99%

BaselineFull ToolACE overall 58.19; Irrelevance 86.42

Who Should Care

What To Try In 7 Days

Fine-tune your 8B LLM with a 25k-sample subset from ToolACE using LoRA to test function-calling gains.

Add a rule-based syntax checker for all generated function calls before model training.

Implement a model-based verifier to catch factual/parameter hallucinations in synthetic tool outputs.

Agent Features

Memory

  • API example buffer (iterative evolution)

Planning

  • self-guided complexity evaluation

Tool Use

  • function calling
  • API synthesis

Frameworks

  • TSS
  • SDG
  • DLV

Is Agentic

true

Architectures

  • multi-agent generation

Collaboration

  • user-assistant-tool role-play

Optimization Features

Training Optimization

  • LoRA
  • loss-based complexity sampling

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Complexity evaluation scales poorly with larger models and many samples; compute cost grows quickly.
  • ToolACE-trained models still lag top models (e.g., GPT-4) on broad reasoning benchmarks.
  • Non-uniform sampling can bias training and leave hard examples underrepresented.
  • Only a subset of data and models are publicly released.

When Not To Use

  • You need large-scale general reasoning or SOTA multi-task performance across many non-tool benchmarks.
  • You require ground-truth live API execution during training rather than simulated tool responses.
  • You need fully open-source datasets and code beyond the partially released subset.

Failure Modes

  • Verification misses can let hallucinated parameter values into training data, propagating errors.
  • Overfitting to synthetic API styles may reduce real-API generalization outside the sampled domains.
  • Evaluator bias (using the learner as evaluator) can exclude useful hard samples.
  • Multi-agent rollouts can converge on repetitive or consensus behaviors if diversity controls are weak.

Core Entities

Models

  • ToolACE-8B
  • LLaMA-3.1-8B-Instruct
  • Qwen-1.5-xB-Chat
  • xLAM-7b-fc-r
  • Gorilla-OpenFunctions-v2
  • GPT-4 (for comparison)

Metrics

  • Accuracy
  • Relevance
  • Irrelevance

Datasets

  • ToolACE synthetic dataset (full API pool)
  • ToolACE subsets (easy/medium/hard; low/med/high diversity)

Benchmarks

  • BFCL (Berkeley Function-Calling Leaderboard)
  • API-Bank