ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

June 28, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.55

Cost Impact Score

0.4

Citation Count

0

Authors

Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana

Links

Abstract / PDF

Why It Matters For Business

Tool-augmented LLMs can call wrong or non-existent tools and often overestimate solvability; that risks incorrect automation, unsafe commands, and wasted API calls in production.

Summary TLDR

ToolBH is a diagnostic benchmark that tests how LLMs hallucinate when asked to use external tools. It defines three evaluation depths—solvability detection (L1), solution planning (L2), and missing-tool analysis (L3)—and three toolset scenarios that induce hallucination: missing necessary tools, potential (hidden) tools, and limited-function tools. The authors build 700 curated samples (50 solvable + 50 unsolvable × 7 subtasks), run 14 models (7 proprietary, 7 open-weight), and show even state-of-the-art models struggle: best overall score was 45.3% (Gemini-1.5-Pro) and open-weight models lag especially on unsolvable cases. Main failure modes are solvability hallucination, predicting non‑exi

Problem Statement

When LLMs are asked to use external tools, they can hallucinate—calling tools that don't exist, misjudging whether a task is solvable with the given tools, or planning incorrect tool sequences. Existing tool benchmarks assume a complete tool list and miss these real-world failure modes. We need a diagnostic benchmark that exposes why LLMs hallucinate in tool-augmented settings.

Main Contribution

ToolBH: a multi-level diagnostic benchmark (L1 solvability, L2 planning, L3 missing-tool analysis) for tool-augmented LLMs.

A breadth taxonomy of tool scenarios that induce hallucination: Missing Necessary Tools (MNT), Potential Tools (PT), and Limited Functionality Tools (LFT).

A curated dataset of 700 samples (7 subtasks × 100 samples) with human-in-the-loop generation and filtering; evaluation metrics and code released under MIT.

Large-scale evaluation of 14 LLMs (7 proprietary, 7 open-weight) with detailed error analysis and practical failure modes.

Key Findings

Top proprietary models still perform poorly on tool-hallucination tasks.

NumbersGemini-1.5-Pro overall score = 45.3%, GPT-4o = 37.0% (Table 2)

Open-weight models underperform on unsolvable tasks relative to proprietary models.

NumbersOn unsolvable tasks open-weight models reach 39.4% of proprietary models' performance (Sec. 5.3, Table 3)

The dominant error type is 'solvability hallucination'—models assert solvability when tools are insufficient.

Response length impacts models differently: open-weight models degrade with verbosity; proprietary models often benefit from longer reasoning.

Results

Overall score (ToolBH, higher is better)

ValueGemini-1.5-Pro: 45.3%, GPT-4o: 37.0%, Llama-3-70B: 14.6%

Accuracy

ValueGPT-4-0613: 59.3%, Gemini-1.5-Pro: 62.7%, Llama-3-70B: 31.3%

Level-3 matching score (L3-MS)

ValueGemini-1.5-Pro MNT: 36.6% (example), GPT-4o MNT: 24.8% (example)

Open-weight vs proprietary on unsolvable tasks

ValueOpen-weight = 39.4% of proprietary performance on unsolvable cases

Who Should Care

What To Try In 7 Days

Run ToolBH or a subset on your internal toolset to find solvability-hallucination cases.

Add a solvability classifier or guardrail before any tool call to block impossible actions.

Limit verbose planning in open models (shorter outputs) and require structured tool plans (<tool sequence> tags).

Agent Features

Planning

  • solution planning (decompose to subtasks)
  • tool sequencing (Progress Rate metric)

Tool Use

  • API/tool selection
  • explicit 'UnsolvableQuery' guard

Frameworks

  • ReAct

Architectures

  • tool-augmented LLMs
  • MoE

Optimization Features

Infra Optimization

  • vLLM used for open-weight inference

Reproducibility

License

  • MIT

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Limited model coverage: seven open-weight and seven proprietary models only.
  • Tool descriptions use only tool names (no API parameter tests), so results may not fully reflect API-driven tool use.
  • Annotation team limited to one region; potential cultural or regional bias in sample design.

When Not To Use

  • To evaluate API-level tool correctness including parameter passing (ToolBH uses name-only tool descriptions).
  • As the sole benchmark for cross-lingual or domain-specific tool suites not represented in ToolBH.

Failure Modes

  • Solvability hallucination: model claims a task is solvable when it is not.
  • Non-existent tool prediction: calling tools not in the provided list.
  • Wrong tool reasoning: selecting tools in the wrong order or for wrong subgoals.
  • Long-text forgetting: verbose outputs that lose track of unsolvable subgoals.

Core Entities

Models

  • Gemini-1.5-Pro
  • GPT-4o
  • GPT-4-Turbo
  • GPT-4-0613
  • GPT-4-1106
  • GPT-3.5-Turbo
  • Gemini-1.0-Pro
  • Llama-3-70B
  • Llama-3-8B
  • Llama-2-70B
  • Llama-2-13B
  • Llama-2-7B
  • Mistral-7B
  • Mixtral-8x7B

Metrics

  • L1-EM (Exact Match solvability)
  • L2-PR (Progress Rate for tool sequence)
  • L3-PR (Progress Rate)
  • L3-MS (Matching Score via embedding similarity)
  • Overall score (percentage)

Datasets

  • ToolBH (ToolBeHonest) benchmark

Benchmarks

  • AgentBench
  • ToolBench
  • AgentBoard
  • MetaTool
  • StableToolBench