ToolTalk: a small automated benchmark for measuring multi-step tool use in dialogs

November 15, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Nicholas Farn, Richard Shin

Links

Abstract / PDF

Why It Matters For Business

If you plan to automate user tasks with LLMs, expect frequent multi-step failures and risky incorrect side effects; instrument tool calls and add verification before irreversible actions.

Summary TLDR

ToolTalk is a focused benchmark of 78 multi-turn conversations that test an assistant's ability to call external tools in dialogue. It contains 28 tools (7 plugins) with executable simulations and ground-truth tool-call traces. Evaluating OpenAI's function-calling GPT-3.5 and GPT-4 shows high success on easy one-call tasks but low success on multi-step conversations (GPT-3.5: 26% hard, GPT-4: 50% hard). Main failure modes are premature calls, poor planning, and wrong arguments. The dataset and simulator are public.

Problem Statement

Existing tool-use tests either ask single-shot API calls or lack action tools and automated checking. We need a conversational, multi-step, automated benchmark that includes tools with side effects so we can measure realistic assistant behavior.

Main Contribution

ToolTalk dataset: 78 multi-turn conversations using 28 tools across 7 plugins with executable simulated tools and ground-truth tool calls.

Evaluation protocol distinguishing action (side-effect) vs non-action tools and measuring recall, precision, incorrect action rate, and conversation-level success.

Empirical evaluation of function-calling GPT-3.5 and GPT-4 and error analysis identifying three main failure modes and the impact of tool documentation.

Key Findings

Multi-step tool use is still hard: GPT-4 achieves only 50% success on hard conversations.

NumbersGPT-4 success rate 50% (hard)

GPT-3.5 performs substantially worse than GPT-4 on hard conversations.

NumbersGPT-3.5 success rate 26% (hard)

Tool documentation materially improves performance.

NumbersHard success drops: GPT-4 50% → 34% without docs

Primary failure modes are planning, premature calls, and wrong arguments.

NumbersGPT-4 failing-turn split: 42% faulty planning, 32% premature, 26% incorrect args

Easy single-call tasks are usually solved by both models.

NumbersEasy success: GPT-3.5 85.7%, GPT-4 92.8%

Results

success rate

Value85.7%

success rate

Value92.8%

success rate

Value26.0%

success rate

Value50.0%

precision

Value74.9%

incorrect action rate

Value25.1%

Who Should Care

What To Try In 7 Days

Expose concise API docs to the model and re-run key flows to check improvements.

Add simple 'do you want to proceed?' confirmations for action tools to avoid wrong-side effects.

Log predicted tool calls and execution results to detect hallucinated arguments and common failure patterns.

Agent Features

Memory

  • short-term conversation history

Planning

  • multi-step tool orchestration
  • tool selection planning

Tool Use

  • function calling
  • action vs non-action tool handling

Frameworks

  • OpenAI Chat Completions API (function calling)

Is Agentic

true

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small dataset: 78 conversations is limited coverage for broad real-world behaviors.
  • Scenarios were generated with GPT-4 then manually edited; some content is synthetic.
  • Tools are simulated; real APIs and error modes may differ.
  • Ground-truth sequences may not capture all valid alternative tool-call orders.

When Not To Use

  • Evaluating agents that require live web access or real network side effects.
  • Measuring open-ended retrieval performance or large-scale API coverage.

Failure Modes

  • Premature tool calls with hallucinated arguments.
  • Faulty planning and omission of required tools.
  • Correct tool chosen but wrong or misformatted arguments.

Core Entities

Models

  • gpt-3.5-turbo-0613
  • gpt-4-0613

Metrics

  • success rate
  • precision
  • recall
  • incorrect action rate

Datasets

  • ToolTalk

Benchmarks

  • ToolTalk

Context Entities

Models

  • GPT-4 (used to generate scenarios)

Datasets

  • Prior tool benchmarks (ToolBench, API-Bank, AgentBench) as compared in paper