ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

June 23, 20237 min

Overview

Decision SnapshotNeeds Validation

The dataset and experiments are practical and reproducible, but current baselines show low performance on hard cases, so expect substantial engineering before production-grade tool agents.

Citations39

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 40%

Authors

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Who Should Care

Summary TLDR

ToolQA is a benchmark and pipeline designed to test whether large language models actually use external tools to answer questions rather than rely on memorized knowledge. It covers 8 domains and 13 tool types, with 800 “easy” and 730 “hard” questions drawn from out-of-pretraining reference corpora. Off-the-shelf LLMs without tools score ≈5% on easy questions; the best tool-augmented method scores 43.1% on easy and only 8.2% on hard questions, highlighting major gaps in tool planning, argument formation, and source selection.

Problem Statement

Existing evaluations cannot reliably tell when an LLM answers from its internal memory versus by querying external data and tools. We need a benchmark with reference corpora outside pretraining data and tool-based questions so we can fairly measure real tool-use abilities.

Main Contribution

ToolQA dataset: 8 domains, 13 tool types, 800 easy + 730 hard questions designed to require tool calls.

Automated three-phase curation: reference-data collection (out-of-pretraining), human-guided template generation, and programmatic answer generation.

Key Findings

Standard LLMs that do not use external tools fail on ToolQA.

NumbersChatGPT avg success: 5.6% (easy), ~2% (hard)

Practical UseDo not expect out-of-the-box LLMs to reliably answer questions that require fresh external data; add tooling or retrieval.

Evidence RefTable 3 & 4

Tool-augmented methods improve but still perform poorly on hard tasks.

NumbersBest tool-augmented: 43.15% (easy), 8.2% (hard)

Practical UseTool integration helps for simple lookups but current tool planners struggle with multi-step composition; expect low accuracy on complex queries.

Evidence RefTable 3 & 4 (ReAct and Chameleon results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average success rate (easy questions)43.15%ReAct (GPT-3)ToolQA easy (800 q)ReAct (GPT-3) average in Table 3Table 3
Average success rate (hard questions)8.2%ReAct (GPT-3.5)ToolQA hard (730 q)ReAct (GPT-3.5) average in Table 4Table 4

What To Try In 7 Days

Run a subset of ToolQA against your tool pipeline to spot argument and source-selection failures.

Add simple validation for tool arguments and observed outputs before accepting answers.

Provide tool-level demonstrations (one per tool) in prompts to reduce formatting and call errors.

Agent Features

Memory
short-term interaction history used in context; no long-term memory evaluated
Planning
tool composition planningiterative plan refinement from execution feedback
Tool Use
text retrievaldatabase operations (FilterDB, GetValue)SQL interpreterPython interpretermath calculator (WolframAlpha)graph queries (Neighbour/Node/Edge checks)
Frameworks
ReActChameleon
Is Agentic

Yes

Architectures
prompt-controller with tool poolin-context planner + tool-call loop
Collaboration
LLM acts as controller that composes external tools

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseApache-2.0

Risks & Boundaries

Limitations

Agenda corpus is synthetic; may not reflect real personal-data complexity

Tool set limited to 13 predefined operators and local corpora

When Not To Use

When you only need to evaluate parametric (memorized) knowledge

When your application uses external services/APIs not covered by the 13 tool types

Failure Modes

wrong tool arguments (common)

selecting incorrect reference corpus

Core Entities

Models

ChatGPT (gpt-3.5-turbo)text-davinci-003 (GPT-3)ReAct (GPT-3)ReAct (GPT-3.5)Chameleon

Metrics

Exact-match success rate

Datasets

ToolQA (this paper)Flight status (2022-2023)Daily Coffee Price (2000-2022)Yelp (subset)Airbnb (NY subset)DBLP citation networkGSM8K (sampled error cases)SciREXAgenda (synthetic)

Benchmarks

ToolQA