ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

June 28, 20247 min

Overview

Decision SnapshotNeeds Validation

Well-executed benchmark with 700 curated samples and 14 models tested; findings are robust for diagnosing tool-use hallucinations but are limited to the tested models and tool formats.

Citations0

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 55%

Authors

Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tool-augmented LLMs can call wrong or non-existent tools and often overestimate solvability; that risks incorrect automation, unsafe commands, and wasted API calls in production.

Who Should Care

Summary TLDR

ToolBH is a diagnostic benchmark that tests how LLMs hallucinate when asked to use external tools. It defines three evaluation depths—solvability detection (L1), solution planning (L2), and missing-tool analysis (L3)—and three toolset scenarios that induce hallucination: missing necessary tools, potential (hidden) tools, and limited-function tools. The authors build 700 curated samples (50 solvable + 50 unsolvable × 7 subtasks), run 14 models (7 proprietary, 7 open-weight), and show even state-of-the-art models struggle: best overall score was 45.3% (Gemini-1.5-Pro) and open-weight models lag especially on unsolvable cases. Main failure modes are solvability hallucination, predicting non‑exi

Problem Statement

When LLMs are asked to use external tools, they can hallucinate—calling tools that don't exist, misjudging whether a task is solvable with the given tools, or planning incorrect tool sequences. Existing tool benchmarks assume a complete tool list and miss these real-world failure modes. We need a diagnostic benchmark that exposes why LLMs hallucinate in tool-augmented settings.

Main Contribution

ToolBH: a multi-level diagnostic benchmark (L1 solvability, L2 planning, L3 missing-tool analysis) for tool-augmented LLMs.

A breadth taxonomy of tool scenarios that induce hallucination: Missing Necessary Tools (MNT), Potential Tools (PT), and Limited Functionality Tools (LFT).

Key Findings

Top proprietary models still perform poorly on tool-hallucination tasks.

NumbersGemini-1.5-Pro overall score = 45.3%, GPT-4o = 37.0% (Table 2)

Practical UseDo not assume a high-quality LLM will reliably choose or describe tools; validate tool plans with a diagnostic step before execution.

Evidence RefTable 2

Open-weight models underperform on unsolvable tasks relative to proprietary models.

NumbersOn unsolvable tasks open-weight models reach 39.4% of proprietary models' performance (Sec. 5.3, Table 3)

Practical UseIf you need robust behavior under incomplete toolsets, prefer validated proprietary models or retrain open models with targeted unsolvability examples.

Evidence RefSec. 5.3 / Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall score (ToolBH, higher is better)Gemini-1.5-Pro: 45.3%, GPT-4o: 37.0%, Llama-3-70B: 14.6%ToolBH (all scenarios)Table 2 reports overall percentage scores across 14 modelsTable 2
AccuracyGPT-4-0613: 59.3%, Gemini-1.5-Pro: 62.7%, Llama-3-70B: 31.3%ToolBH (L1)L1-EM column in Table 2Table 2

What To Try In 7 Days

Run ToolBH or a subset on your internal toolset to find solvability-hallucination cases.

Add a solvability classifier or guardrail before any tool call to block impossible actions.

Limit verbose planning in open models (shorter outputs) and require structured tool plans (<tool sequence> tags).

Agent Features

Planning
solution planning (decompose to subtasks)tool sequencing (Progress Rate metric)
Tool Use
API/tool selectionexplicit 'UnsolvableQuery' guard
Frameworks
ReAct
Architectures
tool-augmented LLMsMoE

Optimization Features

Infra Optimization
vLLM used for open-weight inference

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseMIT

Risks & Boundaries

Limitations

Limited model coverage: seven open-weight and seven proprietary models only.

Tool descriptions use only tool names (no API parameter tests), so results may not fully reflect API-driven tool use.

When Not To Use

To evaluate API-level tool correctness including parameter passing (ToolBH uses name-only tool descriptions).

As the sole benchmark for cross-lingual or domain-specific tool suites not represented in ToolBH.

Failure Modes

Solvability hallucination: model claims a task is solvable when it is not.

Non-existent tool prediction: calling tools not in the provided list.

Core Entities

Models

Gemini-1.5-ProGPT-4oGPT-4-TurboGPT-4-0613GPT-4-1106GPT-3.5-TurboGemini-1.0-ProLlama-3-70BLlama-3-8BLlama-2-70BLlama-2-13BLlama-2-7BMistral-7BMixtral-8x7B

Metrics

L1-EM (Exact Match solvability)L2-PR (Progress Rate for tool sequence)L3-PR (Progress Rate)L3-MS (Matching Score via embedding similarity)Overall score (percentage)

Datasets

ToolBH (ToolBeHonest) benchmark

Benchmarks

AgentBenchToolBenchAgentBoardMetaToolStableToolBench