A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

Overview

Decision SnapshotNeeds Validation

ToolSandbox is ready for internal evaluation and research use; expect extra engineering to scale milestone annotation and to make some external-tool-backed scenarios fully reproducible.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ToolSandbox tests realistic, multi-turn tool use and highlights where agents hallucinate, fail to sequence dependent actions, or mis-handle time—insights you need before putting LLM agents in customer-facing automation.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Data Scientist

Summary TLDR

ToolSandbox is an open Python-based benchmark that evaluates how well LLMs act as tool-using agents in realistic, multi-turn conversations. It adds three key features missing from many prior suites: stateful tools that change and depend on a mutable world state, an LLM-powered on-policy user simulator for interactive rollouts, and a milestone/minefield evaluation that scores intermediate and final outcomes across any dialog trajectory. The suite contains 1,032 human-authored scenarios, 34 tools, and metrics covering trajectory similarity and turn efficiency. Results show a clear gap between proprietary and open-source models and reveal persistent failure modes like state dependency, time/can

Problem Statement

Existing tool-use benchmarks are often single-turn, stateless, or use fixed off-policy trajectories. Real tasks need stateful tools, on-policy conversations, and flexible evaluation that accepts many valid trajectories. We need a benchmark that measures these aspects and reveals where agents still fail in realistic tool-driven dialogs.

Main Contribution

ToolSandbox: a Python-native, stateful, conversational and interactive benchmark with 1,032 human-authored scenarios.

An on-policy LLM user simulator (with knowledge boundary and demonstrations) for realistic multi-turn evaluation.

Key Findings

Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.

NumbersTop scores: GPT-4o 73.0 vs Hermes 31.4 (Table 5)

Practical UseIf you need reliable tool-use agents today, prefer high-end proprietary models; expect open-source models to need targeted improvements before production.

Evidence RefTable 5

Hard, realistic categories remain challenging: state dependency, canonicalization and insufficient information.

NumbersInsufficient Information score for top model (GPT-4o) is 42.0 (Table 5)

Practical UseDesign focused robustness tests and guardrails for hallucination, time reasoning, and dependent-tool sequences before deploying agents.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Top model average similarity	GPT-4o 73.0 (Table 5)	—	—	ToolSandbox overall	Table 5 shows per-model Avg Score	Table 5
Open-source vs proprietary gap	Hermes 31.4 vs Claude-3-Haiku 54.9 (Table 5)	—	-23.5 points	ToolSandbox overall	Table 5 per-model scores	Table 5

What To Try In 7 Days

Run ToolSandbox on your agent to surface concrete failure cases (state dependency, time/canonicalization, hallucination).

Add milestone-style checks to critical multi-step automations so partial progress is visible and auditable.

Implement a simple knowledge boundary and examples for any simulated user to reduce noisy evaluation.

Agent Features

Memory

short-term world state (Execution Context) trackingmessage-bus conversational history

Planning

tool sequencingerror-retry planningbacktracking across nested dependencies

Tool Use

stateful tool executionfunction calling via JSON-to-Python conversiontool-argument canonicalization testing

Frameworks

Python Execution ContextMessage BusMilestone / Minefield evaluation

Is Agentic

Yes

Architectures

LLM-based function-calling agents

Collaboration

on-policy LLM user simulation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/apple/ToolSandbox

Data URLs

https://github.com/apple/ToolSandbox

Risks & Boundaries

Limitations

Milestone and minefield authoring is labor intensive and hard to scale.

User simulator still hallucinates and has instruction-following errors despite improvements.

When Not To Use

If you need fully automated, large-scale benchmark generation today—milestone authoring is manual.

If your tasks require daemon or asynchronous callbacks that interrupt flow.

Failure Modes

Agent issues parallel dependent tool calls, causing race conditions and penalties.

Agent hallucinates tool names or arguments under insufficient information.

Core Entities

Models

GPT-4oClaude-3-OpusGPT-3.5-TurboGPT-4Claude-3-SonnetGemini-1.5-ProGemini-1.0-ProHermes-2-Pro-Mistral-7BMistral-7B-Instruct-v0.3C4AI-Command-R-v01Gorilla-Openfunctions-v2

Metrics

trajectory similarity (milestone/minefield similarity)average similarity scoreaverage turn counttool call AST matchingexecution result exact match

Datasets

ToolSandbox (1032 scenarios)

Benchmarks

BFCLToolEvalAPI-BankToolTalkτ-bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.

Hard, realistic categories remain challenging: state dependency, canonicalization and insufficient information.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding