Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.
Summary TLDR
ToolACE is an automated pipeline that synthesizes diverse, complex, and verified function-calling data for LLMs. It builds a 26,507-API pool via a self-evolution synthesis step, generates multi-agent dialogs guided by a loss-based complexity evaluator, and filters data via a rule checker plus model-based checks. Fine-tuning an 8B model (ToolACE-8B) with this data yields state-of-the-art function-calling performance on BFCL and API-Bank benchmarks, matching many API-backed systems on evaluated tasks. A subset of the model and data is publicly released.
Problem Statement
Real function-calling requires diverse, accurate, and complex examples, but collecting labeled real API dialogs is costly and public pipelines produce shallow, narrow samples. LLMs need training data matched to their capability and validated for executability. ToolACE aims to auto-synthesize large-scale, capability-aware, and verified function-calling data to improve zero-shot and fine-tuned tool use.
Main Contribution
A three-stage automated pipeline (Tool Self-Evolution Synthesis, Self-Guided Dialog Generation, Dual-Layer Verification) to create large-scale function-calling data.
A loss-based complexity evaluator that uses the target LLM to steer dialog difficulty so training samples fit model capacity.
A dual-layer verification system (rule checker + model checkers) to ensure syntactic executability and content correctness of generated function calls.
Key Findings
ToolACE builds a very large synthetic API pool.
An 8B model fine-tuned on ToolACE achieves top leaderboard performance for function calling.
ToolACE matches or beats open-source baselines on API-Bank.
Dual-layer verification meaningfully improves final model performance.
Balanced complexity is best for learning.
Data diversity matters; removing multi-type or parallel samples hurts behavior detection.
Results
BFCL overall (ToolACE-8B)
BFCL Non-live AST (ToolACE-8B)
Accuracy
Training-data comparison (overall after 25k samples)
Ablation: remove multi-type
Who Should Care
What To Try In 7 Days
Fine-tune your 8B LLM with a 25k-sample subset from ToolACE using LoRA to test function-calling gains.
Add a rule-based syntax checker for all generated function calls before model training.
Implement a model-based verifier to catch factual/parameter hallucinations in synthetic tool outputs.
Agent Features
Memory
- API example buffer (iterative evolution)
Planning
- self-guided complexity evaluation
Tool Use
- function calling
- API synthesis
Frameworks
- TSS
- SDG
- DLV
Is Agentic
true
Architectures
- multi-agent generation
Collaboration
- user-assistant-tool role-play
Optimization Features
Training Optimization
- LoRA
- loss-based complexity sampling
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Complexity evaluation scales poorly with larger models and many samples; compute cost grows quickly.
- ToolACE-trained models still lag top models (e.g., GPT-4) on broad reasoning benchmarks.
- Non-uniform sampling can bias training and leave hard examples underrepresented.
- Only a subset of data and models are publicly released.
When Not To Use
- You need large-scale general reasoning or SOTA multi-task performance across many non-tool benchmarks.
- You require ground-truth live API execution during training rather than simulated tool responses.
- You need fully open-source datasets and code beyond the partially released subset.
Failure Modes
- Verification misses can let hallucinated parameter values into training data, propagating errors.
- Overfitting to synthetic API styles may reduce real-API generalization outside the sampled domains.
- Evaluator bias (using the learner as evaluator) can exclude useful hard samples.
- Multi-agent rollouts can converge on repetitive or consensus behaviors if diversity controls are weak.
Core Entities
Models
- ToolACE-8B
- LLaMA-3.1-8B-Instruct
- Qwen-1.5-xB-Chat
- xLAM-7b-fc-r
- Gorilla-OpenFunctions-v2
- GPT-4 (for comparison)
Metrics
- Accuracy
- Relevance
- Irrelevance
Datasets
- ToolACE synthetic dataset (full API pool)
- ToolACE subsets (easy/medium/hard; low/med/high diversity)
Benchmarks
- BFCL (Berkeley Function-Calling Leaderboard)
- API-Bank

