ToolTalk: a small automated benchmark for measuring multi-step tool use in dialogs

Overview

Decision SnapshotNeeds Validation

The benchmark is useful and reproducible, but small (78 conversations) and synthetic; evidence comes from clear tables and open code but results are limited to evaluated models and simulated tools.

Citations2

Evidence Strength0.70

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Nicholas Farn, Richard Shin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to automate user tasks with LLMs, expect frequent multi-step failures and risky incorrect side effects; instrument tool calls and add verification before irreversible actions.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

ToolTalk is a focused benchmark of 78 multi-turn conversations that test an assistant's ability to call external tools in dialogue. It contains 28 tools (7 plugins) with executable simulations and ground-truth tool-call traces. Evaluating OpenAI's function-calling GPT-3.5 and GPT-4 shows high success on easy one-call tasks but low success on multi-step conversations (GPT-3.5: 26% hard, GPT-4: 50% hard). Main failure modes are premature calls, poor planning, and wrong arguments. The dataset and simulator are public.

Problem Statement

Existing tool-use tests either ask single-shot API calls or lack action tools and automated checking. We need a conversational, multi-step, automated benchmark that includes tools with side effects so we can measure realistic assistant behavior.

Main Contribution

ToolTalk dataset: 78 multi-turn conversations using 28 tools across 7 plugins with executable simulated tools and ground-truth tool calls.

Evaluation protocol distinguishing action (side-effect) vs non-action tools and measuring recall, precision, incorrect action rate, and conversation-level success.

Key Findings

Multi-step tool use is still hard: GPT-4 achieves only 50% success on hard conversations.

NumbersGPT-4 success rate 50% (hard)

Practical UseExpect half of realistic multi-tool conversations to fail; plan for fallback checks or human review for critical tasks.

Evidence RefTable 1

GPT-3.5 performs substantially worse than GPT-4 on hard conversations.

NumbersGPT-3.5 success rate 26% (hard)

Practical UsePrefer stronger models (e.g., GPT-4 class) for tool orchestration when possible; otherwise add guardrails.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
success rate	85.7%	—	—	GPT-3.5 (easy)	Table 1 reports 85.7% success on easy subset for GPT-3.5	Table 1
success rate	92.8%	—	—	GPT-4 (easy)	Table 1 reports 92.8% success on easy subset for GPT-4	Table 1

What To Try In 7 Days

Expose concise API docs to the model and re-run key flows to check improvements.

Add simple 'do you want to proceed?' confirmations for action tools to avoid wrong-side effects.

Log predicted tool calls and execution results to detect hallucinated arguments and common failure patterns.

Agent Features

Memory

short-term conversation history

Planning

multi-step tool orchestrationtool selection planning

Tool Use

function callingaction vs non-action tool handling

Frameworks

OpenAI Chat Completions API (function calling)

Is Agentic

Yes

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/ToolTalk

Data URLs

https://github.com/microsoft/ToolTalk

Risks & Boundaries

Limitations

Small dataset: 78 conversations is limited coverage for broad real-world behaviors.

Scenarios were generated with GPT-4 then manually edited; some content is synthetic.

When Not To Use

Evaluating agents that require live web access or real network side effects.

Measuring open-ended retrieval performance or large-scale API coverage.

Failure Modes

Premature tool calls with hallucinated arguments.

Faulty planning and omission of required tools.

Core Entities

Models

gpt-3.5-turbo-0613gpt-4-0613

Metrics

success rateprecisionrecallincorrect action rate

Datasets

ToolTalk

Benchmarks

ToolTalk

Context Entities

Models

GPT-4 (used to generate scenarios)

Datasets

Prior tool benchmarks (ToolBench, API-Bank, AgentBench) as compared in paper

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-step tool use is still hard: GPT-4 achieves only 50% success on hard conversations.

GPT-3.5 performs substantially worse than GPT-4 on hard conversations.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding