A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

August 8, 20247 min

Overview

Decision SnapshotNeeds Validation

ToolSandbox is ready for internal evaluation and research use; expect extra engineering to scale milestone annotation and to make some external-tool-backed scenarios fully reproducible.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ToolSandbox tests realistic, multi-turn tool use and highlights where agents hallucinate, fail to sequence dependent actions, or mis-handle time—insights you need before putting LLM agents in customer-facing automation.

Who Should Care

Summary TLDR

ToolSandbox is an open Python-based benchmark that evaluates how well LLMs act as tool-using agents in realistic, multi-turn conversations. It adds three key features missing from many prior suites: stateful tools that change and depend on a mutable world state, an LLM-powered on-policy user simulator for interactive rollouts, and a milestone/minefield evaluation that scores intermediate and final outcomes across any dialog trajectory. The suite contains 1,032 human-authored scenarios, 34 tools, and metrics covering trajectory similarity and turn efficiency. Results show a clear gap between proprietary and open-source models and reveal persistent failure modes like state dependency, time/can

Problem Statement

Existing tool-use benchmarks are often single-turn, stateless, or use fixed off-policy trajectories. Real tasks need stateful tools, on-policy conversations, and flexible evaluation that accepts many valid trajectories. We need a benchmark that measures these aspects and reveals where agents still fail in realistic tool-driven dialogs.

Main Contribution

ToolSandbox: a Python-native, stateful, conversational and interactive benchmark with 1,032 human-authored scenarios.

An on-policy LLM user simulator (with knowledge boundary and demonstrations) for realistic multi-turn evaluation.

Key Findings

Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.

NumbersTop scores: GPT-4o 73.0 vs Hermes 31.4 (Table 5)

Practical UseIf you need reliable tool-use agents today, prefer high-end proprietary models; expect open-source models to need targeted improvements before production.

Evidence RefTable 5

Hard, realistic categories remain challenging: state dependency, canonicalization and insufficient information.

NumbersInsufficient Information score for top model (GPT-4o) is 42.0 (Table 5)

Practical UseDesign focused robustness tests and guardrails for hallucination, time reasoning, and dependent-tool sequences before deploying agents.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Top model average similarityGPT-4o 73.0 (Table 5)ToolSandbox overallTable 5 shows per-model Avg ScoreTable 5
Open-source vs proprietary gapHermes 31.4 vs Claude-3-Haiku 54.9 (Table 5)-23.5 pointsToolSandbox overallTable 5 per-model scoresTable 5

What To Try In 7 Days

Run ToolSandbox on your agent to surface concrete failure cases (state dependency, time/canonicalization, hallucination).

Add milestone-style checks to critical multi-step automations so partial progress is visible and auditable.

Implement a simple knowledge boundary and examples for any simulated user to reduce noisy evaluation.

Agent Features

Memory
short-term world state (Execution Context) trackingmessage-bus conversational history
Planning
tool sequencingerror-retry planningbacktracking across nested dependencies
Tool Use
stateful tool executionfunction calling via JSON-to-Python conversiontool-argument canonicalization testing
Frameworks
Python Execution ContextMessage BusMilestone / Minefield evaluation
Is Agentic

Yes

Architectures
LLM-based function-calling agents
Collaboration
on-policy LLM user simulation

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Milestone and minefield authoring is labor intensive and hard to scale.

User simulator still hallucinates and has instruction-following errors despite improvements.

When Not To Use

If you need fully automated, large-scale benchmark generation today—milestone authoring is manual.

If your tasks require daemon or asynchronous callbacks that interrupt flow.

Failure Modes

Agent issues parallel dependent tool calls, causing race conditions and penalties.

Agent hallucinates tool names or arguments under insufficient information.

Core Entities

Models

GPT-4oClaude-3-OpusGPT-3.5-TurboGPT-4Claude-3-SonnetGemini-1.5-ProGemini-1.0-ProHermes-2-Pro-Mistral-7BMistral-7B-Instruct-v0.3C4AI-Command-R-v01Gorilla-Openfunctions-v2

Metrics

trajectory similarity (milestone/minefield similarity)average similarity scoreaverage turn counttool call AST matchingexecution result exact match

Datasets

ToolSandbox (1032 scenarios)

Benchmarks

BFCLToolEvalAPI-BankToolTalkτ-bench