Overview
Production Readiness
0.3
Novelty Score
0.55
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
Tool-augmented LLMs can call wrong or non-existent tools and often overestimate solvability; that risks incorrect automation, unsafe commands, and wasted API calls in production.
Summary TLDR
ToolBH is a diagnostic benchmark that tests how LLMs hallucinate when asked to use external tools. It defines three evaluation depths—solvability detection (L1), solution planning (L2), and missing-tool analysis (L3)—and three toolset scenarios that induce hallucination: missing necessary tools, potential (hidden) tools, and limited-function tools. The authors build 700 curated samples (50 solvable + 50 unsolvable × 7 subtasks), run 14 models (7 proprietary, 7 open-weight), and show even state-of-the-art models struggle: best overall score was 45.3% (Gemini-1.5-Pro) and open-weight models lag especially on unsolvable cases. Main failure modes are solvability hallucination, predicting non‑exi
Problem Statement
When LLMs are asked to use external tools, they can hallucinate—calling tools that don't exist, misjudging whether a task is solvable with the given tools, or planning incorrect tool sequences. Existing tool benchmarks assume a complete tool list and miss these real-world failure modes. We need a diagnostic benchmark that exposes why LLMs hallucinate in tool-augmented settings.
Main Contribution
ToolBH: a multi-level diagnostic benchmark (L1 solvability, L2 planning, L3 missing-tool analysis) for tool-augmented LLMs.
A breadth taxonomy of tool scenarios that induce hallucination: Missing Necessary Tools (MNT), Potential Tools (PT), and Limited Functionality Tools (LFT).
A curated dataset of 700 samples (7 subtasks × 100 samples) with human-in-the-loop generation and filtering; evaluation metrics and code released under MIT.
Large-scale evaluation of 14 LLMs (7 proprietary, 7 open-weight) with detailed error analysis and practical failure modes.
Key Findings
Top proprietary models still perform poorly on tool-hallucination tasks.
Open-weight models underperform on unsolvable tasks relative to proprietary models.
The dominant error type is 'solvability hallucination'—models assert solvability when tools are insufficient.
Response length impacts models differently: open-weight models degrade with verbosity; proprietary models often benefit from longer reasoning.
Results
Overall score (ToolBH, higher is better)
Accuracy
Level-3 matching score (L3-MS)
Open-weight vs proprietary on unsolvable tasks
Who Should Care
What To Try In 7 Days
Run ToolBH or a subset on your internal toolset to find solvability-hallucination cases.
Add a solvability classifier or guardrail before any tool call to block impossible actions.
Limit verbose planning in open models (shorter outputs) and require structured tool plans (<tool sequence> tags).
Agent Features
Planning
- solution planning (decompose to subtasks)
- tool sequencing (Progress Rate metric)
Tool Use
- API/tool selection
- explicit 'UnsolvableQuery' guard
Frameworks
- ReAct
Architectures
- tool-augmented LLMs
- MoE
Optimization Features
Infra Optimization
- vLLM used for open-weight inference
Reproducibility
License
- MIT
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Limited model coverage: seven open-weight and seven proprietary models only.
- Tool descriptions use only tool names (no API parameter tests), so results may not fully reflect API-driven tool use.
- Annotation team limited to one region; potential cultural or regional bias in sample design.
When Not To Use
- To evaluate API-level tool correctness including parameter passing (ToolBH uses name-only tool descriptions).
- As the sole benchmark for cross-lingual or domain-specific tool suites not represented in ToolBH.
Failure Modes
- Solvability hallucination: model claims a task is solvable when it is not.
- Non-existent tool prediction: calling tools not in the provided list.
- Wrong tool reasoning: selecting tools in the wrong order or for wrong subgoals.
- Long-text forgetting: verbose outputs that lose track of unsolvable subgoals.
Core Entities
Models
- Gemini-1.5-Pro
- GPT-4o
- GPT-4-Turbo
- GPT-4-0613
- GPT-4-1106
- GPT-3.5-Turbo
- Gemini-1.0-Pro
- Llama-3-70B
- Llama-3-8B
- Llama-2-70B
- Llama-2-13B
- Llama-2-7B
- Mistral-7B
- Mixtral-8x7B
Metrics
- L1-EM (Exact Match solvability)
- L2-PR (Progress Rate for tool sequence)
- L3-PR (Progress Rate)
- L3-MS (Matching Score via embedding similarity)
- Overall score (percentage)
Datasets
- ToolBH (ToolBeHonest) benchmark
Benchmarks
- AgentBench
- ToolBench
- AgentBoard
- MetaTool
- StableToolBench

