A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

August 8, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

2

Authors

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang

Links

Abstract / PDF

Why It Matters For Business

ToolSandbox tests realistic, multi-turn tool use and highlights where agents hallucinate, fail to sequence dependent actions, or mis-handle time—insights you need before putting LLM agents in customer-facing automation.

Summary TLDR

ToolSandbox is an open Python-based benchmark that evaluates how well LLMs act as tool-using agents in realistic, multi-turn conversations. It adds three key features missing from many prior suites: stateful tools that change and depend on a mutable world state, an LLM-powered on-policy user simulator for interactive rollouts, and a milestone/minefield evaluation that scores intermediate and final outcomes across any dialog trajectory. The suite contains 1,032 human-authored scenarios, 34 tools, and metrics covering trajectory similarity and turn efficiency. Results show a clear gap between proprietary and open-source models and reveal persistent failure modes like state dependency, time/can

Problem Statement

Existing tool-use benchmarks are often single-turn, stateless, or use fixed off-policy trajectories. Real tasks need stateful tools, on-policy conversations, and flexible evaluation that accepts many valid trajectories. We need a benchmark that measures these aspects and reveals where agents still fail in realistic tool-driven dialogs.

Main Contribution

ToolSandbox: a Python-native, stateful, conversational and interactive benchmark with 1,032 human-authored scenarios.

An on-policy LLM user simulator (with knowledge boundary and demonstrations) for realistic multi-turn evaluation.

A milestone and minefield scoring system that evaluates intermediate and final execution across arbitrary dialog trajectories.

A toolbox of 34 composable tools across 11 domains, with tool-augmentation variants for robustness testing.

Empirical evaluation showing large gaps between proprietary and open-source models and exposing concrete failure modes.

Key Findings

Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.

NumbersTop scores: GPT-4o 73.0 vs Hermes 31.4 (Table 5)

Hard, realistic categories remain challenging: state dependency, canonicalization and insufficient information.

NumbersInsufficient Information score for top model (GPT-4o) is 42.0 (Table 5)

On-policy simulated user prompts with knowledge boundaries and demonstrations cut user-simulator errors roughly in half.

NumbersHallucination errors: 12.4% → 6.97%; IF errors: 6.2% → 0.77% (Table 2)

ToolSandbox yields deeper, longer interactions than prior benchmarks.

NumbersAvg turns 13.9 and 3.8 tool calls per dialog across 1,032 cases (Table 4)

Results

Top model average similarity

ValueGPT-4o 73.0 (Table 5)

Open-source vs proprietary gap

ValueHermes 31.4 vs Claude-3-Haiku 54.9 (Table 5)

Dataset size and dialog complexity

Value1,032 scenarios; avg turns 13.9; avg tool calls 3.8

Who Should Care

What To Try In 7 Days

Run ToolSandbox on your agent to surface concrete failure cases (state dependency, time/canonicalization, hallucination).

Add milestone-style checks to critical multi-step automations so partial progress is visible and auditable.

Implement a simple knowledge boundary and examples for any simulated user to reduce noisy evaluation.

Agent Features

Memory

  • short-term world state (Execution Context) tracking
  • message-bus conversational history

Planning

  • tool sequencing
  • error-retry planning
  • backtracking across nested dependencies

Tool Use

  • stateful tool execution
  • function calling via JSON-to-Python conversion
  • tool-argument canonicalization testing

Frameworks

  • Python Execution Context
  • Message Bus
  • Milestone / Minefield evaluation

Is Agentic

true

Architectures

  • LLM-based function-calling agents

Collaboration

  • on-policy LLM user simulation

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Milestone and minefield authoring is labor intensive and hard to scale.
  • User simulator still hallucinates and has instruction-following errors despite improvements.
  • Some tools rely on external web services, reducing reproducibility.
  • Daemon-style tools (future asynchronous interrupts) are not supported.

When Not To Use

  • If you need fully automated, large-scale benchmark generation today—milestone authoring is manual.
  • If your tasks require daemon or asynchronous callbacks that interrupt flow.
  • If you require fully offline reproducibility for all external-tool queries without caching.

Failure Modes

  • Agent issues parallel dependent tool calls, causing race conditions and penalties.
  • Agent hallucinates tool names or arguments under insufficient information.
  • Agent mis-canonicalizes dates/times and hallucinates timestamps.
  • User simulator hallucination can still inject noise into on-policy evaluation.

Core Entities

Models

  • GPT-4o
  • Claude-3-Opus
  • GPT-3.5-Turbo
  • GPT-4
  • Claude-3-Sonnet
  • Gemini-1.5-Pro
  • Gemini-1.0-Pro
  • Hermes-2-Pro-Mistral-7B
  • Mistral-7B-Instruct-v0.3
  • C4AI-Command-R-v01
  • Gorilla-Openfunctions-v2

Metrics

  • trajectory similarity (milestone/minefield similarity)
  • average similarity score
  • average turn count
  • tool call AST matching
  • execution result exact match

Datasets

  • ToolSandbox (1032 scenarios)

Benchmarks

  • BFCL
  • ToolEval
  • API-Bank
  • ToolTalk
  • τ-bench