ToolEyes: a 7-scenario, 568-tool evaluation that measures five concrete tool-learning skills

January 1, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

1

Authors

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang

Links

Abstract / PDF

Why It Matters For Business

ToolEyes quantifies how well models actually use real APIs and multi-step tools, revealing that tool-specific fine-tuning and output-format controls matter more than raw model size for production tool integration.

Summary TLDR

This paper introduces ToolEyes, a fine-grained evaluation system for how well LLMs learn and use external tools in real-world tasks. ToolEyes defines seven realistic scenarios (text generation, data understanding, real-time search, application manipulation, personal life, information retrieval, financial transactions), a library of 568 tools, and five capability dimensions: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. The authors evaluate 10 LLMs (open-source, tool-oriented, closed-source). Key takeaways: closed-source models (GPT-4) lead but still score poorly on planning; tool-oriented fine-tuning helps; larger model size can worsen “f

Problem Statement

Current tool-use evaluations either require pre-specified tool-answer mappings or only check final outcomes. That misses core cognitive skills needed to use tools—formatting, understanding intent, planning multi-step actions, selecting correct tools/parameters, and organizing final answers—especially in messy real-world tool ecosystems.

Main Contribution

ToolEyes: a fine-grained evaluation system covering seven real-world scenarios and a 568-tool library.

A five-dimension rubric for tool learning: format alignment, intent comprehension, behavior planning, tool selection, answer organization.

An empirical study of 10 LLMs showing scenario preferences, weak planning skills, and cases where larger models perform worse for tool learning.

Key Findings

GPT-4 achieves the highest overall tool-learning score among tested models.

Numberss_overall = 70.31% (Table 2)

Tool-oriented fine-tuning substantially improves performance versus generic chat models.

NumbersToolLLaMA-2-7B-v2 s_overall = 56.30% vs LLaMA-2-chat-7B 13.59%

Behavior planning is weak across models, even for the best model.

NumbersGPT-4 behavior-planning ≈ 35.70% (Figure 5 / text)

Increasing model size sometimes worsens tool-learning performance due to format and behavioral issues.

NumbersLLaMA-2-chat-70B s_overall = 5.29% and 91% format failures cited

Automated scoring using GPT-4 aligns well with humans.

Numbersagreement >83.5% across dimensions (B.1)

Results

s_overall (GPT-4)

Value70.31%

s_overall (ToolLLaMA-2-7B-v2)

Value56.30%

BaselineLLaMA-2-chat-7B 13.59%

s_overall (Vicuna-1.5-7B)

Value38.76%

s_overall (LLaMA-2-chat-70B)

Value5.29%

Average interaction turns (LLaMA-2-chat-7B)

Value7.0 turns

BaselineGPT-4 2.8 turns

Behavior planning (GPT-4)

Value≈35.70%

Who Should Care

What To Try In 7 Days

Run ToolEyes (or a small slice) on your models to baseline real tool-use performance.

Add strict output-format enforcement (keywords/JSON) to avoid parser breakage.

Fine-tune or SFT on a small tool-oriented dataset (examples of tool calls + reasoning).

Agent Features

Memory

  • short-term interaction turns (multi-turn state)

Planning

  • behavior planning (multi-step planning and summarization)

Tool Use

  • tool selection
  • function calling
  • parameter filling

Frameworks

  • ReAct output format (Thought/Action/Action Input)

Is Agentic

true

Architectures

  • Transformer LLMs (LLaMA/Vicuna/GPT series)

Optimization Features

Training Optimization

  • SFT

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • No new model is proposed — the work is an evaluation system, not a tool-learning model.
  • Scoring relies partly on GPT-4 due to cost; only a subset of models was evaluated with human validation sample.

When Not To Use

  • As the only metric when you need provable, deterministic tool-call correctness for safety-critical apps.
  • If you require evaluation on tools or domains not covered by the 568-tool library without extending it first.

Failure Modes

  • Format alignment breaks: redundant sentences or missing keywords stop tool parsing.
  • Tool hallucinations: models invent tool/parameter names or add escape characters.
  • Parameter hallucination: models invent API keys or required parameters.
  • Scaling pitfalls: larger models may amplify conversational habits that harm structured outputs.

Core Entities

Models

  • LLaMA-2-chat-7B
  • LLaMA-2-chat-13B
  • LLaMA-2-chat-70B
  • Vicuna-1.5-7B
  • Vicuna-1.5-13B
  • ToolLLaMA-2-7B-v1
  • ToolLLaMA-2-7B-v2
  • Text-davinci-003
  • GPT-3.5-turbo
  • GPT-4

Metrics

  • s_overall
  • s_FA (format alignment)
  • s_IC (intent comprehension)
  • s_b-validity
  • s_b-integrity
  • s_t-reality
  • s_t-match
  • s_a-pass
  • s_a-quality

Datasets

  • ToolEyes dataset (382 queries across 7 scenarios, human-crafted)

Benchmarks

  • ToolEyes

Context Entities

Models

  • ToolLLaMA (tool-oriented fine-tuned LLaMA)
  • Vicuna (instruction-following fine-tuned LLaMA variants)

Metrics

  • Welch's ANOVA for scenario variance
  • Human-GPT-4 agreement %

Datasets

  • Tool learning datasets from prior work (e.g., ToolLLaMA training data referenced)

Benchmarks

  • API-Bank, MetaTool, ToolBench variants (compared in Appendix A / Table 5)