Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Overview

Decision SnapshotNeeds Validation

Tool-R0 shows practical promise: it reduces labeling needs and produces measurable gains on five benchmarks, but it needs careful reward tuning, compute for Monte Carlo probing, and more runs to validate statistical robustness.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 80%

Authors

Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Links

Abstract / PDF / Code

Why It Matters For Business

Tool-R0 cuts reliance on costly human-labeled tool-call datasets by letting models self-generate curricula and learn API usage, enabling faster integration of new tools and domains with lower annotation budgets.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

Tool-R0 is a zero-data self-play RL framework that turns a single base LLM into a tool-calling agent. The model is split into a Generator (makes verifiable tool tasks) and a Solver (learns to call tools). A difficulty-aware reward and verification-based dataset pipeline create an easy→hard curriculum. On five function-calling benchmarks, Tool-R0 raises a 1.5B model from 24.85% → 47.84% average accuracy (+22.99 pts, +92.52% relative) and outperforms several supervised baselines, all without human-labeled training data.

Problem Statement

Human-curated tool-calling datasets are costly and static. Can a weak LLM self-improve into a reliable tool-calling agent without any human task data? Tool-R0 studies whether self-play between a Generator and a Solver can autonomously create a verifiable curriculum and teach tool use from scratch.

Main Contribution

Algorithm: Tool-R0, a dual-agent self-play RL loop where a Generator and a Solver co-evolve under complementary rewards to synthesize and learn verifiable tool tasks without external data.

Performance: Demonstrates large gains across five diverse function-calling benchmarks and beats several supervised baselines while using zero curated data.

Key Findings

Self-play yields large real gains from zero data on tool-calling.

NumbersAvg +22.99 pts (24.85 → 47.84); +92.52% rel

Practical UseYou can bootstrap a 1.5B tool agent without labels by running self-play that alternates task generation and solver training.

Evidence RefTable 1

Zero-data Tool-R0 matches or beats supervised baselines on average.

Numbers47.84 (Tool-R0) vs 46.06 (best supervised ToolRL) avg acc

Practical UsePrioritize adaptive self-generated curricula when labeling budgets are limited; it can outperform tens of thousands of human examples on average.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	47.84%	24.85% (Qwen2.5-1.5B base)	+22.99 pp	Tool-Alpaca, Seal-Tools, NexusRaven, API-Bank, SNIPS (avg)	Tool-R0 average after self-play	Table 1
Accuracy	47.84% (Tool-R0)	46.06% (best supervised baseline ToolRL re-trained on same backbone)	+1.78 pp	Avg over five benchmarks	Tool-R0 outperforms several supervised agents trained on 4k–210k samples	Table 2

What To Try In 7 Days

Run a small-scale self-play loop: initialize Generator and Solver from the same base LLM, split parameters, and run 2–3 iterations on a handful of target APIs.

Implement the verification pipeline: require JSON tool menus and gold tool calls to enable execution-based feedback and filter bad samples.

Adopt a band-pass difficulty signal: estimate solver success via K Monte Carlo rollouts and target tasks with mid-range pass probabilities (e.g., 0.25–0.75).

Agent Features

Memory

short-term self-play curriculum (no long-term retrieval)

Planning

tool planningmulti-step tool composition

Tool Use

function callingschema groundingmulti-call sequencing

Frameworks

GRPOcurriculum construction via pass@K

Is Agentic

Yes

Architectures

single-model dual-role (Generator & Solver)

Collaboration

two-role co-evolution (Generator vs Solver)

Optimization Features

Token Efficiency

uses Monte Carlo probes (K=8) for difficulty estimation — extra compute but small K

Infra Optimization

three GPUs with gradient accumulation; pragmatic small-batch self-play setup

Model Optimization

parameter separation for role stability

System Optimization

mixed precision (bfloat16) and DeepSpeed ZeRO-3 used for scaling

Training Optimization

GRPOdifficulty-conditioned curriculumcross-verification and deduplication

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

project page referenced in paper (no direct URL provided in text)

Risks & Boundaries

Limitations

Computational overhead: difficulty estimation queries the Solver multiple times per candidate (K=8), increasing cost.

Reward hacking risk: small models can pass verifiable checks while producing low-quality supervision.

When Not To Use

If strict regulatory audit or human-verifiable provenance is required for every training example.

When compute budget cannot support repeated Monte Carlo difficulty probes and self-play iterations.

Failure Modes

Structural errors: wrong tool name, wrong number of calls, missing/extra arguments.

Semantic errors: incorrect argument values or missing required keys despite syntactic correctness.

Core Entities

Models

Qwen2.5-0.5B-InstructQwen2.5-1.5B-InstructQwen2.5-3B-InstructLlama-3.2-3B-Instruct

Metrics

Accuracypass@K (for difficulty estimation)

Datasets

none (zero-data self-play for training)

Benchmarks

Tool-AlpacaSeal-ToolsNexusRavenAPI-BankSNIPS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-play yields large real gains from zero data on tool-calling.

Zero-data Tool-R0 matches or beats supervised baselines on average.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Generate editable BIM models from plain language by orchestrating LLM agents that write modeling code

Key finding