ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

Overview

Decision SnapshotNeeds Validation

The dataset and experiments are practical and reproducible, but current baselines show low performance on hard cases, so expect substantial engineering before production-grade tool agents.

Citations39

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 40%

Authors

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

ToolQA is a benchmark and pipeline designed to test whether large language models actually use external tools to answer questions rather than rely on memorized knowledge. It covers 8 domains and 13 tool types, with 800 “easy” and 730 “hard” questions drawn from out-of-pretraining reference corpora. Off-the-shelf LLMs without tools score ≈5% on easy questions; the best tool-augmented method scores 43.1% on easy and only 8.2% on hard questions, highlighting major gaps in tool planning, argument formation, and source selection.

Problem Statement

Existing evaluations cannot reliably tell when an LLM answers from its internal memory versus by querying external data and tools. We need a benchmark with reference corpora outside pretraining data and tool-based questions so we can fairly measure real tool-use abilities.

Main Contribution

ToolQA dataset: 8 domains, 13 tool types, 800 easy + 730 hard questions designed to require tool calls.

Automated three-phase curation: reference-data collection (out-of-pretraining), human-guided template generation, and programmatic answer generation.

Key Findings

Standard LLMs that do not use external tools fail on ToolQA.

NumbersChatGPT avg success: 5.6% (easy), ~2% (hard)

Practical UseDo not expect out-of-the-box LLMs to reliably answer questions that require fresh external data; add tooling or retrieval.

Evidence RefTable 3 & 4

Tool-augmented methods improve but still perform poorly on hard tasks.

NumbersBest tool-augmented: 43.15% (easy), 8.2% (hard)

Practical UseTool integration helps for simple lookups but current tool planners struggle with multi-step composition; expect low accuracy on complex queries.

Evidence RefTable 3 & 4 (ReAct and Chameleon results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average success rate (easy questions)	43.15%	ReAct (GPT-3)	—	ToolQA easy (800 q)	ReAct (GPT-3) average in Table 3	Table 3
Average success rate (hard questions)	8.2%	ReAct (GPT-3.5)	—	ToolQA hard (730 q)	ReAct (GPT-3.5) average in Table 4	Table 4

What To Try In 7 Days

Run a subset of ToolQA against your tool pipeline to spot argument and source-selection failures.

Add simple validation for tool arguments and observed outputs before accepting answers.

Provide tool-level demonstrations (one per tool) in prompts to reduce formatting and call errors.

Agent Features

Memory

short-term interaction history used in context; no long-term memory evaluated

Planning

tool composition planningiterative plan refinement from execution feedback

Tool Use

text retrievaldatabase operations (FilterDB, GetValue)SQL interpreterPython interpretermath calculator (WolframAlpha)graph queries (Neighbour/Node/Edge checks)

Frameworks

ReActChameleon

Is Agentic

Yes

Architectures

prompt-controller with tool poolin-context planner + tool-call loop

Collaboration

LLM acts as controller that composes external tools

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseApache-2.0

Code URLs

https://github.com/night-chen/ToolQA

Data URLs

https://github.com/night-chen/ToolQA

Risks & Boundaries

Limitations

Agenda corpus is synthetic; may not reflect real personal-data complexity

Tool set limited to 13 predefined operators and local corpora

When Not To Use

When you only need to evaluate parametric (memorized) knowledge

When your application uses external services/APIs not covered by the 13 tool types

Failure Modes

wrong tool arguments (common)

selecting incorrect reference corpus

Core Entities

Models

ChatGPT (gpt-3.5-turbo)text-davinci-003 (GPT-3)ReAct (GPT-3)ReAct (GPT-3.5)Chameleon

Metrics

Exact-match success rate

Datasets

ToolQA (this paper)Flight status (2022-2023)Daily Coffee Price (2000-2022)Yelp (subset)Airbnb (NY subset)DBLP citation networkGSM8K (sampled error cases)SciREXAgenda (synthetic)

Benchmarks

ToolQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Standard LLMs that do not use external tools fail on ToolQA.

Tool-augmented methods improve but still perform poorly on hard tasks.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding