ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

June 23, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.2

Citation Count

39

Authors

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Summary TLDR

ToolQA is a benchmark and pipeline designed to test whether large language models actually use external tools to answer questions rather than rely on memorized knowledge. It covers 8 domains and 13 tool types, with 800 “easy” and 730 “hard” questions drawn from out-of-pretraining reference corpora. Off-the-shelf LLMs without tools score ≈5% on easy questions; the best tool-augmented method scores 43.1% on easy and only 8.2% on hard questions, highlighting major gaps in tool planning, argument formation, and source selection.

Problem Statement

Existing evaluations cannot reliably tell when an LLM answers from its internal memory versus by querying external data and tools. We need a benchmark with reference corpora outside pretraining data and tool-based questions so we can fairly measure real tool-use abilities.

Main Contribution

ToolQA dataset: 8 domains, 13 tool types, 800 easy + 730 hard questions designed to require tool calls.

Automated three-phase curation: reference-data collection (out-of-pretraining), human-guided template generation, and programmatic answer generation.

Baseline evaluation and error analysis of standard LLMs and tool-augmented methods (ReAct, Chameleon), exposing common failure modes.

Public release of data and code under Apache-2.0 to foster further work.

Key Findings

Standard LLMs that do not use external tools fail on ToolQA.

NumbersChatGPT avg success: 5.6% (easy), ~2% (hard)

Tool-augmented methods improve but still perform poorly on hard tasks.

NumbersBest tool-augmented: 43.15% (easy), 8.2% (hard)

Argument errors and wrong data-source selection dominate failures.

NumbersArgument errors: 44.56% (easy errors), 48.23% (hard errors) of ReAct error cases

Results

Average success rate (easy questions)

Value43.15%

BaselineReAct (GPT-3)

Average success rate (hard questions)

Value8.2%

BaselineReAct (GPT-3.5)

Baseline success (no-tool LLMs)

Value≈5.6% (easy)

BaselineChatGPT (gpt-3.5-turbo)

Who Should Care

What To Try In 7 Days

Run a subset of ToolQA against your tool pipeline to spot argument and source-selection failures.

Add simple validation for tool arguments and observed outputs before accepting answers.

Provide tool-level demonstrations (one per tool) in prompts to reduce formatting and call errors.

Agent Features

Memory

  • short-term interaction history used in context; no long-term memory evaluated

Planning

  • tool composition planning
  • iterative plan refinement from execution feedback

Tool Use

  • text retrieval
  • database operations (FilterDB, GetValue)
  • SQL interpreter
  • Python interpreter
  • math calculator (WolframAlpha)
  • graph queries (Neighbour/Node/Edge checks)

Frameworks

  • ReAct
  • Chameleon

Is Agentic

true

Architectures

  • prompt-controller with tool pool
  • in-context planner + tool-call loop

Collaboration

  • LLM acts as controller that composes external tools

Reproducibility

License

  • Apache-2.0

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Agenda corpus is synthetic; may not reflect real personal-data complexity
  • Tool set limited to 13 predefined operators and local corpora
  • Evaluations did not include larger closed models (e.g., GPT-4) due to lack of access

When Not To Use

  • When you only need to evaluate parametric (memorized) knowledge
  • When your application uses external services/APIs not covered by the 13 tool types
  • When you need human-level end-to-end dialogue evaluation rather than exact-match QA

Failure Modes

  • wrong tool arguments (common)
  • selecting incorrect reference corpus
  • hallucinated observations not backed by tool output
  • infeasible tool actions (tools not in pool)
  • context length overflow leading to missing history

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo)
  • text-davinci-003 (GPT-3)
  • ReAct (GPT-3)
  • ReAct (GPT-3.5)
  • Chameleon

Metrics

  • Exact-match success rate

Datasets

  • ToolQA (this paper)
  • Flight status (2022-2023)
  • Daily Coffee Price (2000-2022)
  • Yelp (subset)
  • Airbnb (NY subset)
  • DBLP citation network
  • GSM8K (sampled error cases)
  • SciREX
  • Agenda (synthetic)

Benchmarks

  • ToolQA