Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
ToolSandbox tests realistic, multi-turn tool use and highlights where agents hallucinate, fail to sequence dependent actions, or mis-handle time—insights you need before putting LLM agents in customer-facing automation.
Summary TLDR
ToolSandbox is an open Python-based benchmark that evaluates how well LLMs act as tool-using agents in realistic, multi-turn conversations. It adds three key features missing from many prior suites: stateful tools that change and depend on a mutable world state, an LLM-powered on-policy user simulator for interactive rollouts, and a milestone/minefield evaluation that scores intermediate and final outcomes across any dialog trajectory. The suite contains 1,032 human-authored scenarios, 34 tools, and metrics covering trajectory similarity and turn efficiency. Results show a clear gap between proprietary and open-source models and reveal persistent failure modes like state dependency, time/can
Problem Statement
Existing tool-use benchmarks are often single-turn, stateless, or use fixed off-policy trajectories. Real tasks need stateful tools, on-policy conversations, and flexible evaluation that accepts many valid trajectories. We need a benchmark that measures these aspects and reveals where agents still fail in realistic tool-driven dialogs.
Main Contribution
ToolSandbox: a Python-native, stateful, conversational and interactive benchmark with 1,032 human-authored scenarios.
An on-policy LLM user simulator (with knowledge boundary and demonstrations) for realistic multi-turn evaluation.
A milestone and minefield scoring system that evaluates intermediate and final execution across arbitrary dialog trajectories.
A toolbox of 34 composable tools across 11 domains, with tool-augmentation variants for robustness testing.
Empirical evaluation showing large gaps between proprietary and open-source models and exposing concrete failure modes.
Key Findings
Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.
Hard, realistic categories remain challenging: state dependency, canonicalization and insufficient information.
On-policy simulated user prompts with knowledge boundaries and demonstrations cut user-simulator errors roughly in half.
ToolSandbox yields deeper, longer interactions than prior benchmarks.
Results
Top model average similarity
Open-source vs proprietary gap
Dataset size and dialog complexity
Who Should Care
What To Try In 7 Days
Run ToolSandbox on your agent to surface concrete failure cases (state dependency, time/canonicalization, hallucination).
Add milestone-style checks to critical multi-step automations so partial progress is visible and auditable.
Implement a simple knowledge boundary and examples for any simulated user to reduce noisy evaluation.
Agent Features
Memory
- short-term world state (Execution Context) tracking
- message-bus conversational history
Planning
- tool sequencing
- error-retry planning
- backtracking across nested dependencies
Tool Use
- stateful tool execution
- function calling via JSON-to-Python conversion
- tool-argument canonicalization testing
Frameworks
- Python Execution Context
- Message Bus
- Milestone / Minefield evaluation
Is Agentic
true
Architectures
- LLM-based function-calling agents
Collaboration
- on-policy LLM user simulation
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Milestone and minefield authoring is labor intensive and hard to scale.
- User simulator still hallucinates and has instruction-following errors despite improvements.
- Some tools rely on external web services, reducing reproducibility.
- Daemon-style tools (future asynchronous interrupts) are not supported.
When Not To Use
- If you need fully automated, large-scale benchmark generation today—milestone authoring is manual.
- If your tasks require daemon or asynchronous callbacks that interrupt flow.
- If you require fully offline reproducibility for all external-tool queries without caching.
Failure Modes
- Agent issues parallel dependent tool calls, causing race conditions and penalties.
- Agent hallucinates tool names or arguments under insufficient information.
- Agent mis-canonicalizes dates/times and hallucinates timestamps.
- User simulator hallucination can still inject noise into on-policy evaluation.
Core Entities
Models
- GPT-4o
- Claude-3-Opus
- GPT-3.5-Turbo
- GPT-4
- Claude-3-Sonnet
- Gemini-1.5-Pro
- Gemini-1.0-Pro
- Hermes-2-Pro-Mistral-7B
- Mistral-7B-Instruct-v0.3
- C4AI-Command-R-v01
- Gorilla-Openfunctions-v2
Metrics
- trajectory similarity (milestone/minefield similarity)
- average similarity score
- average turn count
- tool call AST matching
- execution result exact match
Datasets
- ToolSandbox (1032 scenarios)
Benchmarks
- BFCL
- ToolEval
- API-Bank
- ToolTalk
- τ-bench

