ToolTalk: a small automated benchmark for measuring multi-step tool use in dialogs

November 15, 20236 min

Overview

Decision SnapshotNeeds Validation

The benchmark is useful and reproducible, but small (78 conversations) and synthetic; evidence comes from clear tables and open code but results are limited to evaluated models and simulated tools.

Citations2

Evidence Strength0.70

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Nicholas Farn, Richard Shin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to automate user tasks with LLMs, expect frequent multi-step failures and risky incorrect side effects; instrument tool calls and add verification before irreversible actions.

Who Should Care

Summary TLDR

ToolTalk is a focused benchmark of 78 multi-turn conversations that test an assistant's ability to call external tools in dialogue. It contains 28 tools (7 plugins) with executable simulations and ground-truth tool-call traces. Evaluating OpenAI's function-calling GPT-3.5 and GPT-4 shows high success on easy one-call tasks but low success on multi-step conversations (GPT-3.5: 26% hard, GPT-4: 50% hard). Main failure modes are premature calls, poor planning, and wrong arguments. The dataset and simulator are public.

Problem Statement

Existing tool-use tests either ask single-shot API calls or lack action tools and automated checking. We need a conversational, multi-step, automated benchmark that includes tools with side effects so we can measure realistic assistant behavior.

Main Contribution

ToolTalk dataset: 78 multi-turn conversations using 28 tools across 7 plugins with executable simulated tools and ground-truth tool calls.

Evaluation protocol distinguishing action (side-effect) vs non-action tools and measuring recall, precision, incorrect action rate, and conversation-level success.

Key Findings

Multi-step tool use is still hard: GPT-4 achieves only 50% success on hard conversations.

NumbersGPT-4 success rate 50% (hard)

Practical UseExpect half of realistic multi-tool conversations to fail; plan for fallback checks or human review for critical tasks.

Evidence RefTable 1

GPT-3.5 performs substantially worse than GPT-4 on hard conversations.

NumbersGPT-3.5 success rate 26% (hard)

Practical UsePrefer stronger models (e.g., GPT-4 class) for tool orchestration when possible; otherwise add guardrails.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
success rate85.7%GPT-3.5 (easy)Table 1 reports 85.7% success on easy subset for GPT-3.5Table 1
success rate92.8%GPT-4 (easy)Table 1 reports 92.8% success on easy subset for GPT-4Table 1

What To Try In 7 Days

Expose concise API docs to the model and re-run key flows to check improvements.

Add simple 'do you want to proceed?' confirmations for action tools to avoid wrong-side effects.

Log predicted tool calls and execution results to detect hallucinated arguments and common failure patterns.

Agent Features

Memory
short-term conversation history
Planning
multi-step tool orchestrationtool selection planning
Tool Use
function callingaction vs non-action tool handling
Frameworks
OpenAI Chat Completions API (function calling)
Is Agentic

Yes

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small dataset: 78 conversations is limited coverage for broad real-world behaviors.

Scenarios were generated with GPT-4 then manually edited; some content is synthetic.

When Not To Use

Evaluating agents that require live web access or real network side effects.

Measuring open-ended retrieval performance or large-scale API coverage.

Failure Modes

Premature tool calls with hallucinated arguments.

Faulty planning and omission of required tools.

Core Entities

Models

gpt-3.5-turbo-0613gpt-4-0613

Metrics

success rateprecisionrecallincorrect action rate

Datasets

ToolTalk

Benchmarks

ToolTalk

Context Entities

Models

GPT-4 (used to generate scenarios)

Datasets

Prior tool benchmarks (ToolBench, API-Bank, AgentBench) as compared in paper