Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.2
Citation Count
39
Why It Matters For Business
If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.
Summary TLDR
ToolQA is a benchmark and pipeline designed to test whether large language models actually use external tools to answer questions rather than rely on memorized knowledge. It covers 8 domains and 13 tool types, with 800 “easy” and 730 “hard” questions drawn from out-of-pretraining reference corpora. Off-the-shelf LLMs without tools score ≈5% on easy questions; the best tool-augmented method scores 43.1% on easy and only 8.2% on hard questions, highlighting major gaps in tool planning, argument formation, and source selection.
Problem Statement
Existing evaluations cannot reliably tell when an LLM answers from its internal memory versus by querying external data and tools. We need a benchmark with reference corpora outside pretraining data and tool-based questions so we can fairly measure real tool-use abilities.
Main Contribution
ToolQA dataset: 8 domains, 13 tool types, 800 easy + 730 hard questions designed to require tool calls.
Automated three-phase curation: reference-data collection (out-of-pretraining), human-guided template generation, and programmatic answer generation.
Baseline evaluation and error analysis of standard LLMs and tool-augmented methods (ReAct, Chameleon), exposing common failure modes.
Public release of data and code under Apache-2.0 to foster further work.
Key Findings
Standard LLMs that do not use external tools fail on ToolQA.
Tool-augmented methods improve but still perform poorly on hard tasks.
Argument errors and wrong data-source selection dominate failures.
Results
Average success rate (easy questions)
Average success rate (hard questions)
Baseline success (no-tool LLMs)
Who Should Care
What To Try In 7 Days
Run a subset of ToolQA against your tool pipeline to spot argument and source-selection failures.
Add simple validation for tool arguments and observed outputs before accepting answers.
Provide tool-level demonstrations (one per tool) in prompts to reduce formatting and call errors.
Agent Features
Memory
- short-term interaction history used in context; no long-term memory evaluated
Planning
- tool composition planning
- iterative plan refinement from execution feedback
Tool Use
- text retrieval
- database operations (FilterDB, GetValue)
- SQL interpreter
- Python interpreter
- math calculator (WolframAlpha)
- graph queries (Neighbour/Node/Edge checks)
Frameworks
- ReAct
- Chameleon
Is Agentic
true
Architectures
- prompt-controller with tool pool
- in-context planner + tool-call loop
Collaboration
- LLM acts as controller that composes external tools
Reproducibility
License
- Apache-2.0
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Agenda corpus is synthetic; may not reflect real personal-data complexity
- Tool set limited to 13 predefined operators and local corpora
- Evaluations did not include larger closed models (e.g., GPT-4) due to lack of access
When Not To Use
- When you only need to evaluate parametric (memorized) knowledge
- When your application uses external services/APIs not covered by the 13 tool types
- When you need human-level end-to-end dialogue evaluation rather than exact-match QA
Failure Modes
- wrong tool arguments (common)
- selecting incorrect reference corpus
- hallucinated observations not backed by tool output
- infeasible tool actions (tools not in pool)
- context length overflow leading to missing history
Core Entities
Models
- ChatGPT (gpt-3.5-turbo)
- text-davinci-003 (GPT-3)
- ReAct (GPT-3)
- ReAct (GPT-3.5)
- Chameleon
Metrics
- Exact-match success rate
Datasets
- ToolQA (this paper)
- Flight status (2022-2023)
- Daily Coffee Price (2000-2022)
- Yelp (subset)
- Airbnb (NY subset)
- DBLP citation network
- GSM8K (sampled error cases)
- SciREX
- Agenda (synthetic)
Benchmarks
- ToolQA

