ToolEyes: a 7-scenario, 568-tool evaluation that measures five concrete tool-learning skills

January 1, 20248 min

Overview

Decision SnapshotNeeds Validation

The methodology uses a large, human-crafted tool library and queries, with automated GPT-4 scoring validated against humans; results are reliable for comparative benchmarking but limited by the chosen models and GPT-4 evaluation costs.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ToolEyes quantifies how well models actually use real APIs and multi-step tools, revealing that tool-specific fine-tuning and output-format controls matter more than raw model size for production tool integration.

Who Should Care

Summary TLDR

This paper introduces ToolEyes, a fine-grained evaluation system for how well LLMs learn and use external tools in real-world tasks. ToolEyes defines seven realistic scenarios (text generation, data understanding, real-time search, application manipulation, personal life, information retrieval, financial transactions), a library of 568 tools, and five capability dimensions: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. The authors evaluate 10 LLMs (open-source, tool-oriented, closed-source). Key takeaways: closed-source models (GPT-4) lead but still score poorly on planning; tool-oriented fine-tuning helps; larger model size can worsen “f

Problem Statement

Current tool-use evaluations either require pre-specified tool-answer mappings or only check final outcomes. That misses core cognitive skills needed to use tools—formatting, understanding intent, planning multi-step actions, selecting correct tools/parameters, and organizing final answers—especially in messy real-world tool ecosystems.

Main Contribution

ToolEyes: a fine-grained evaluation system covering seven real-world scenarios and a 568-tool library.

A five-dimension rubric for tool learning: format alignment, intent comprehension, behavior planning, tool selection, answer organization.

Key Findings

GPT-4 achieves the highest overall tool-learning score among tested models.

Numberss_overall = 70.31% (Table 2)

Practical UseExpect the best off-the-shelf closed-source models to lead in tool usage today; use them as baselines when evaluating your own tool-integration work.

Evidence RefTable 2

Tool-oriented fine-tuning substantially improves performance versus generic chat models.

NumbersToolLLaMA-2-7B-v2 s_overall = 56.30% vs LLaMA-2-chat-7B 13.59%

Practical UseIf you need reliable tool use, invest in tool-specific fine-tuning or curated tool-use datasets rather than only scaling general chat models.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
s_overall (GPT-4)70.31%All scenarios (Table 2)GPT-4 overall score from Table 2Table 2
s_overall (ToolLLaMA-2-7B-v2)56.30%LLaMA-2-chat-7B 13.59%+42.71 pp vs LLaMA-2-chat-7BAll scenarios (Table 2)Tool-oriented fine-tuning gains (Table 2)Table 2

What To Try In 7 Days

Run ToolEyes (or a small slice) on your models to baseline real tool-use performance.

Add strict output-format enforcement (keywords/JSON) to avoid parser breakage.

Fine-tune or SFT on a small tool-oriented dataset (examples of tool calls + reasoning).

Agent Features

Memory
short-term interaction turns (multi-turn state)
Planning
behavior planning (multi-step planning and summarization)
Tool Use
tool selectionfunction callingparameter filling
Frameworks
ReAct output format (Thought/Action/Action Input)
Is Agentic

Yes

Architectures
Transformer LLMs (LLaMA/Vicuna/GPT series)

Optimization Features

Training Optimization
SFT

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

No new model is proposed — the work is an evaluation system, not a tool-learning model.

Scoring relies partly on GPT-4 due to cost; only a subset of models was evaluated with human validation sample.

When Not To Use

As the only metric when you need provable, deterministic tool-call correctness for safety-critical apps.

If you require evaluation on tools or domains not covered by the 568-tool library without extending it first.

Failure Modes

Format alignment breaks: redundant sentences or missing keywords stop tool parsing.

Tool hallucinations: models invent tool/parameter names or add escape characters.

Core Entities

Models

LLaMA-2-chat-7BLLaMA-2-chat-13BLLaMA-2-chat-70BVicuna-1.5-7BVicuna-1.5-13BToolLLaMA-2-7B-v1ToolLLaMA-2-7B-v2Text-davinci-003GPT-3.5-turboGPT-4

Metrics

s_overalls_FA (format alignment)s_IC (intent comprehension)s_b-validitys_b-integritys_t-realitys_t-matchs_a-passs_a-quality

Datasets

ToolEyes dataset (382 queries across 7 scenarios, human-crafted)

Benchmarks

ToolEyes

Context Entities

Models

ToolLLaMA (tool-oriented fine-tuned LLaMA)Vicuna (instruction-following fine-tuned LLaMA variants)

Metrics

Welch's ANOVA for scenario varianceHuman-GPT-4 agreement %

Datasets

Tool learning datasets from prior work (e.g., ToolLLaMA training data referenced)

Benchmarks

API-Bank, MetaTool, ToolBench variants (compared in Appendix A / Table 5)