Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

February 24, 20267 min

Overview

Production Readiness

0.5

Novelty Score

0.8

Cost Impact Score

0.7

Citation Count

0

Authors

Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Links

Abstract / PDF

Why It Matters For Business

Tool-R0 cuts reliance on costly human-labeled tool-call datasets by letting models self-generate curricula and learn API usage, enabling faster integration of new tools and domains with lower annotation budgets.

Summary TLDR

Tool-R0 is a zero-data self-play RL framework that turns a single base LLM into a tool-calling agent. The model is split into a Generator (makes verifiable tool tasks) and a Solver (learns to call tools). A difficulty-aware reward and verification-based dataset pipeline create an easy→hard curriculum. On five function-calling benchmarks, Tool-R0 raises a 1.5B model from 24.85% → 47.84% average accuracy (+22.99 pts, +92.52% relative) and outperforms several supervised baselines, all without human-labeled training data.

Problem Statement

Human-curated tool-calling datasets are costly and static. Can a weak LLM self-improve into a reliable tool-calling agent without any human task data? Tool-R0 studies whether self-play between a Generator and a Solver can autonomously create a verifiable curriculum and teach tool use from scratch.

Main Contribution

Algorithm: Tool-R0, a dual-agent self-play RL loop where a Generator and a Solver co-evolve under complementary rewards to synthesize and learn verifiable tool tasks without external data.

Performance: Demonstrates large gains across five diverse function-calling benchmarks and beats several supervised baselines while using zero curated data.

Analysis & Tools: Ablations and dynamics that quantify the role of difficulty shaping, parameter separation, curriculum ordering, saturation behavior, and mid-training benefits; plus a modular codebase reported by authors.

Key Findings

Self-play yields large real gains from zero data on tool-calling.

NumbersAvg +22.99 pts (24.85 → 47.84); +92.52% rel

Zero-data Tool-R0 matches or beats supervised baselines on average.

Numbers47.84 (Tool-R0) vs 46.06 (best supervised ToolRL) avg acc

Separating Generator and Solver parameters is critical for stability.

NumbersShared-weights ablation: −17.42 pp (47.84 → 30.42), −36.41% rel

Difficulty-aware curriculum shaping materially improves learning.

NumbersRemoving difficulty reward: −4.30 pp (↓ 8.99%)

Active Generator learning matters: static generators hurt performance.

NumbersFreezing Generator: −6.19 pp

Results

Accuracy

Value47.84%

Baseline24.85% (Qwen2.5-1.5B base)

Accuracy

Value47.84% (Tool-R0)

Baseline46.06% (best supervised baseline ToolRL re-trained on same backbone)

Small-model improvement (0.5B)

Value30.57%

Baseline15.47% (Qwen2.5-0.5B base)

Who Should Care

What To Try In 7 Days

Run a small-scale self-play loop: initialize Generator and Solver from the same base LLM, split parameters, and run 2–3 iterations on a handful of target APIs.

Implement the verification pipeline: require JSON tool menus and gold tool calls to enable execution-based feedback and filter bad samples.

Adopt a band-pass difficulty signal: estimate solver success via K Monte Carlo rollouts and target tasks with mid-range pass probabilities (e.g., 0.25–0.75).

Agent Features

Memory

  • short-term self-play curriculum (no long-term retrieval)

Planning

  • tool planning
  • multi-step tool composition

Tool Use

  • function calling
  • schema grounding
  • multi-call sequencing

Frameworks

  • GRPO
  • curriculum construction via pass@K

Is Agentic

true

Architectures

  • single-model dual-role (Generator & Solver)

Collaboration

  • two-role co-evolution (Generator vs Solver)

Optimization Features

Token Efficiency

  • uses Monte Carlo probes (K=8) for difficulty estimation — extra compute but small K

Infra Optimization

  • three GPUs with gradient accumulation; pragmatic small-batch self-play setup

Model Optimization

  • parameter separation for role stability

System Optimization

  • mixed precision (bfloat16) and DeepSpeed ZeRO-3 used for scaling

Training Optimization

  • GRPO
  • difficulty-conditioned curriculum
  • cross-verification and deduplication

Reproducibility

Code Urls

  • project page referenced in paper (no direct URL provided in text)

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Computational overhead: difficulty estimation queries the Solver multiple times per candidate (K=8), increasing cost.
  • Reward hacking risk: small models can pass verifiable checks while producing low-quality supervision.
  • Early saturation for low-capacity models: improvements often plateau by iteration three.
  • Reported results lack full multi-run statistical error bars in this preprint.

When Not To Use

  • If strict regulatory audit or human-verifiable provenance is required for every training example.
  • When compute budget cannot support repeated Monte Carlo difficulty probes and self-play iterations.
  • If you need deterministic, annotated datasets for traceable evaluation or certification.

Failure Modes

  • Structural errors: wrong tool name, wrong number of calls, missing/extra arguments.
  • Semantic errors: incorrect argument values or missing required keys despite syntactic correctness.
  • Reward-hacking: outputs that satisfy parse/validity checks but provide poor grounding.
  • Curriculum collapse: Generator mode-collapse without grounded task specifications.

Core Entities

Models

  • Qwen2.5-0.5B-Instruct
  • Qwen2.5-1.5B-Instruct
  • Qwen2.5-3B-Instruct
  • Llama-3.2-3B-Instruct

Metrics

  • Accuracy
  • pass@K (for difficulty estimation)

Datasets

  • none (zero-data self-play for training)

Benchmarks

  • Tool-Alpaca
  • Seal-Tools
  • NexusRaven
  • API-Bank
  • SNIPS