Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

February 24, 20267 min

Overview

Decision SnapshotNeeds Validation

Tool-R0 shows practical promise: it reduces labeling needs and produces measurable gains on five benchmarks, but it needs careful reward tuning, compute for Monte Carlo probing, and more runs to validate statistical robustness.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 80%

Authors

Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Links

Abstract / PDF / Code

Why It Matters For Business

Tool-R0 cuts reliance on costly human-labeled tool-call datasets by letting models self-generate curricula and learn API usage, enabling faster integration of new tools and domains with lower annotation budgets.

Who Should Care

Summary TLDR

Tool-R0 is a zero-data self-play RL framework that turns a single base LLM into a tool-calling agent. The model is split into a Generator (makes verifiable tool tasks) and a Solver (learns to call tools). A difficulty-aware reward and verification-based dataset pipeline create an easy→hard curriculum. On five function-calling benchmarks, Tool-R0 raises a 1.5B model from 24.85% → 47.84% average accuracy (+22.99 pts, +92.52% relative) and outperforms several supervised baselines, all without human-labeled training data.

Problem Statement

Human-curated tool-calling datasets are costly and static. Can a weak LLM self-improve into a reliable tool-calling agent without any human task data? Tool-R0 studies whether self-play between a Generator and a Solver can autonomously create a verifiable curriculum and teach tool use from scratch.

Main Contribution

Algorithm: Tool-R0, a dual-agent self-play RL loop where a Generator and a Solver co-evolve under complementary rewards to synthesize and learn verifiable tool tasks without external data.

Performance: Demonstrates large gains across five diverse function-calling benchmarks and beats several supervised baselines while using zero curated data.

Key Findings

Self-play yields large real gains from zero data on tool-calling.

NumbersAvg +22.99 pts (24.8547.84); +92.52% rel

Practical UseYou can bootstrap a 1.5B tool agent without labels by running self-play that alternates task generation and solver training.

Evidence RefTable 1

Zero-data Tool-R0 matches or beats supervised baselines on average.

Numbers47.84 (Tool-R0) vs 46.06 (best supervised ToolRL) avg acc

Practical UsePrioritize adaptive self-generated curricula when labeling budgets are limited; it can outperform tens of thousands of human examples on average.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy47.84%24.85% (Qwen2.5-1.5B base)+22.99 ppTool-Alpaca, Seal-Tools, NexusRaven, API-Bank, SNIPS (avg)Tool-R0 average after self-playTable 1
Accuracy47.84% (Tool-R0)46.06% (best supervised baseline ToolRL re-trained on same backbone)+1.78 ppAvg over five benchmarksTool-R0 outperforms several supervised agents trained on 4k–210k samplesTable 2

What To Try In 7 Days

Run a small-scale self-play loop: initialize Generator and Solver from the same base LLM, split parameters, and run 2–3 iterations on a handful of target APIs.

Implement the verification pipeline: require JSON tool menus and gold tool calls to enable execution-based feedback and filter bad samples.

Adopt a band-pass difficulty signal: estimate solver success via K Monte Carlo rollouts and target tasks with mid-range pass probabilities (e.g., 0.25–0.75).

Agent Features

Memory
short-term self-play curriculum (no long-term retrieval)
Planning
tool planningmulti-step tool composition
Tool Use
function callingschema groundingmulti-call sequencing
Frameworks
GRPOcurriculum construction via pass@K
Is Agentic

Yes

Architectures
single-model dual-role (Generator & Solver)
Collaboration
two-role co-evolution (Generator vs Solver)

Optimization Features

Token Efficiency
uses Monte Carlo probes (K=8) for difficulty estimation — extra compute but small K
Infra Optimization
three GPUs with gradient accumulation; pragmatic small-batch self-play setup
Model Optimization
parameter separation for role stability
System Optimization
mixed precision (bfloat16) and DeepSpeed ZeRO-3 used for scaling
Training Optimization
GRPOdifficulty-conditioned curriculumcross-verification and deduplication

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

project page referenced in paper (no direct URL provided in text)

Risks & Boundaries

Limitations

Computational overhead: difficulty estimation queries the Solver multiple times per candidate (K=8), increasing cost.

Reward hacking risk: small models can pass verifiable checks while producing low-quality supervision.

When Not To Use

If strict regulatory audit or human-verifiable provenance is required for every training example.

When compute budget cannot support repeated Monte Carlo difficulty probes and self-play iterations.

Failure Modes

Structural errors: wrong tool name, wrong number of calls, missing/extra arguments.

Semantic errors: incorrect argument values or missing required keys despite syntactic correctness.

Core Entities

Models

Qwen2.5-0.5B-InstructQwen2.5-1.5B-InstructQwen2.5-3B-InstructLlama-3.2-3B-Instruct

Metrics

Accuracypass@K (for difficulty estimation)

Datasets

none (zero-data self-play for training)

Benchmarks

Tool-AlpacaSeal-ToolsNexusRavenAPI-BankSNIPS