Overview
Well-executed benchmark with 700 curated samples and 14 models tested; findings are robust for diagnosing tool-use hallucinations but are limited to the tested models and tool formats.
Citations0
Evidence Strength0.80
Confidence0.88
Risk Signals9
Trust Signals
Findings with numeric evidence: 2/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
License: MIT
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 55%
Why It Matters For Business
Tool-augmented LLMs can call wrong or non-existent tools and often overestimate solvability; that risks incorrect automation, unsafe commands, and wasted API calls in production.
Who Should Care
Summary TLDR
ToolBH is a diagnostic benchmark that tests how LLMs hallucinate when asked to use external tools. It defines three evaluation depths—solvability detection (L1), solution planning (L2), and missing-tool analysis (L3)—and three toolset scenarios that induce hallucination: missing necessary tools, potential (hidden) tools, and limited-function tools. The authors build 700 curated samples (50 solvable + 50 unsolvable × 7 subtasks), run 14 models (7 proprietary, 7 open-weight), and show even state-of-the-art models struggle: best overall score was 45.3% (Gemini-1.5-Pro) and open-weight models lag especially on unsolvable cases. Main failure modes are solvability hallucination, predicting non‑exi
Problem Statement
When LLMs are asked to use external tools, they can hallucinate—calling tools that don't exist, misjudging whether a task is solvable with the given tools, or planning incorrect tool sequences. Existing tool benchmarks assume a complete tool list and miss these real-world failure modes. We need a diagnostic benchmark that exposes why LLMs hallucinate in tool-augmented settings.
Main Contribution
ToolBH: a multi-level diagnostic benchmark (L1 solvability, L2 planning, L3 missing-tool analysis) for tool-augmented LLMs.
A breadth taxonomy of tool scenarios that induce hallucination: Missing Necessary Tools (MNT), Potential Tools (PT), and Limited Functionality Tools (LFT).
Key Findings
Top proprietary models still perform poorly on tool-hallucination tasks.
Open-weight models underperform on unsolvable tasks relative to proprietary models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall score (ToolBH, higher is better) | Gemini-1.5-Pro: 45.3%, GPT-4o: 37.0%, Llama-3-70B: 14.6% | — | — | ToolBH (all scenarios) | Table 2 reports overall percentage scores across 14 models | Table 2 |
| Accuracy | GPT-4-0613: 59.3%, Gemini-1.5-Pro: 62.7%, Llama-3-70B: 31.3% | — | — | ToolBH (L1) | L1-EM column in Table 2 | Table 2 |
What To Try In 7 Days
Run ToolBH or a subset on your internal toolset to find solvability-hallucination cases.
Add a solvability classifier or guardrail before any tool call to block impossible actions.
Limit verbose planning in open models (shorter outputs) and require structured tool plans (<tool sequence> tags).
Agent Features
Planning
Tool Use
Frameworks
Architectures
Optimization Features
Infra Optimization
Reproducibility
Risks & Boundaries
Limitations
Limited model coverage: seven open-weight and seven proprietary models only.
Tool descriptions use only tool names (no API parameter tests), so results may not fully reflect API-driven tool use.
When Not To Use
To evaluate API-level tool correctness including parameter passing (ToolBH uses name-only tool descriptions).
As the sole benchmark for cross-lingual or domain-specific tool suites not represented in ToolBH.
Failure Modes
Solvability hallucination: model claims a task is solvable when it is not.
Non-existent tool prediction: calling tools not in the provided list.

