ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Overview

Decision SnapshotNeeds Validation

Well-executed benchmark with 700 curated samples and 14 models tested; findings are robust for diagnosing tool-use hallucinations but are limited to the tested models and tool formats.

Citations0

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 55%

Authors

Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tool-augmented LLMs can call wrong or non-existent tools and often overestimate solvability; that risks incorrect automation, unsafe commands, and wasted API calls in production.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

ToolBH is a diagnostic benchmark that tests how LLMs hallucinate when asked to use external tools. It defines three evaluation depths—solvability detection (L1), solution planning (L2), and missing-tool analysis (L3)—and three toolset scenarios that induce hallucination: missing necessary tools, potential (hidden) tools, and limited-function tools. The authors build 700 curated samples (50 solvable + 50 unsolvable × 7 subtasks), run 14 models (7 proprietary, 7 open-weight), and show even state-of-the-art models struggle: best overall score was 45.3% (Gemini-1.5-Pro) and open-weight models lag especially on unsolvable cases. Main failure modes are solvability hallucination, predicting non‑exi

Problem Statement

When LLMs are asked to use external tools, they can hallucinate—calling tools that don't exist, misjudging whether a task is solvable with the given tools, or planning incorrect tool sequences. Existing tool benchmarks assume a complete tool list and miss these real-world failure modes. We need a diagnostic benchmark that exposes why LLMs hallucinate in tool-augmented settings.

Main Contribution

ToolBH: a multi-level diagnostic benchmark (L1 solvability, L2 planning, L3 missing-tool analysis) for tool-augmented LLMs.

A breadth taxonomy of tool scenarios that induce hallucination: Missing Necessary Tools (MNT), Potential Tools (PT), and Limited Functionality Tools (LFT).

Key Findings

Top proprietary models still perform poorly on tool-hallucination tasks.

NumbersGemini-1.5-Pro overall score = 45.3%, GPT-4o = 37.0% (Table 2)

Practical UseDo not assume a high-quality LLM will reliably choose or describe tools; validate tool plans with a diagnostic step before execution.

Evidence RefTable 2

Open-weight models underperform on unsolvable tasks relative to proprietary models.

NumbersOn unsolvable tasks open-weight models reach 39.4% of proprietary models' performance (Sec. 5.3, Table 3)

Practical UseIf you need robust behavior under incomplete toolsets, prefer validated proprietary models or retrain open models with targeted unsolvability examples.

Evidence RefSec. 5.3 / Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall score (ToolBH, higher is better)	Gemini-1.5-Pro: 45.3%, GPT-4o: 37.0%, Llama-3-70B: 14.6%	—	—	ToolBH (all scenarios)	Table 2 reports overall percentage scores across 14 models	Table 2
Accuracy	GPT-4-0613: 59.3%, Gemini-1.5-Pro: 62.7%, Llama-3-70B: 31.3%	—	—	ToolBH (L1)	L1-EM column in Table 2	Table 2

What To Try In 7 Days

Run ToolBH or a subset on your internal toolset to find solvability-hallucination cases.

Add a solvability classifier or guardrail before any tool call to block impossible actions.

Limit verbose planning in open models (shorter outputs) and require structured tool plans (<tool sequence> tags).

Agent Features

Planning

solution planning (decompose to subtasks)tool sequencing (Progress Rate metric)

Tool Use

API/tool selectionexplicit 'UnsolvableQuery' guard

Frameworks

ReAct

Architectures

tool-augmented LLMsMoE

Optimization Features

Infra Optimization

vLLM used for open-weight inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseMIT

Code URLs

https://github.com/ToolBeHonest/ToolBeHonest

Data URLs

https://github.com/ToolBeHonest/ToolBeHonest

Risks & Boundaries

Limitations

Limited model coverage: seven open-weight and seven proprietary models only.

Tool descriptions use only tool names (no API parameter tests), so results may not fully reflect API-driven tool use.

When Not To Use

To evaluate API-level tool correctness including parameter passing (ToolBH uses name-only tool descriptions).

As the sole benchmark for cross-lingual or domain-specific tool suites not represented in ToolBH.

Failure Modes

Solvability hallucination: model claims a task is solvable when it is not.

Non-existent tool prediction: calling tools not in the provided list.

Core Entities

Models

Gemini-1.5-ProGPT-4oGPT-4-TurboGPT-4-0613GPT-4-1106GPT-3.5-TurboGemini-1.0-ProLlama-3-70BLlama-3-8BLlama-2-70BLlama-2-13BLlama-2-7BMistral-7BMixtral-8x7B

Metrics

L1-EM (Exact Match solvability)L2-PR (Progress Rate for tool sequence)L3-PR (Progress Rate)L3-MS (Matching Score via embedding similarity)Overall score (percentage)

Datasets

ToolBH (ToolBeHonest) benchmark

Benchmarks

AgentBenchToolBenchAgentBoardMetaToolStableToolBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top proprietary models still perform poorly on tool-hallucination tasks.

Open-weight models underperform on unsolvable tasks relative to proprietary models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding