ToolEyes: a 7-scenario, 568-tool evaluation that measures five concrete tool-learning skills

Overview

Decision SnapshotNeeds Validation

The methodology uses a large, human-crafted tool library and queries, with automated GPT-4 scoring validated against humans; results are reliable for comparative benchmarking but limited by the chosen models and GPT-4 evaluation costs.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ToolEyes quantifies how well models actually use real APIs and multi-step tools, revealing that tool-specific fine-tuning and output-format controls matter more than raw model size for production tool integration.

Who Should Care

ML Engineer Product Manager CTO Data Scientist Engineering Lead

Summary TLDR

This paper introduces ToolEyes, a fine-grained evaluation system for how well LLMs learn and use external tools in real-world tasks. ToolEyes defines seven realistic scenarios (text generation, data understanding, real-time search, application manipulation, personal life, information retrieval, financial transactions), a library of 568 tools, and five capability dimensions: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. The authors evaluate 10 LLMs (open-source, tool-oriented, closed-source). Key takeaways: closed-source models (GPT-4) lead but still score poorly on planning; tool-oriented fine-tuning helps; larger model size can worsen “f

Problem Statement

Current tool-use evaluations either require pre-specified tool-answer mappings or only check final outcomes. That misses core cognitive skills needed to use tools—formatting, understanding intent, planning multi-step actions, selecting correct tools/parameters, and organizing final answers—especially in messy real-world tool ecosystems.

Main Contribution

ToolEyes: a fine-grained evaluation system covering seven real-world scenarios and a 568-tool library.

A five-dimension rubric for tool learning: format alignment, intent comprehension, behavior planning, tool selection, answer organization.

Key Findings

GPT-4 achieves the highest overall tool-learning score among tested models.

Numberss_overall = 70.31% (Table 2)

Practical UseExpect the best off-the-shelf closed-source models to lead in tool usage today; use them as baselines when evaluating your own tool-integration work.

Evidence RefTable 2

Tool-oriented fine-tuning substantially improves performance versus generic chat models.

NumbersToolLLaMA-2-7B-v2 s_overall = 56.30% vs LLaMA-2-chat-7B 13.59%

Practical UseIf you need reliable tool use, invest in tool-specific fine-tuning or curated tool-use datasets rather than only scaling general chat models.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
s_overall (GPT-4)	70.31%	—	—	All scenarios (Table 2)	GPT-4 overall score from Table 2	Table 2
s_overall (ToolLLaMA-2-7B-v2)	56.30%	LLaMA-2-chat-7B 13.59%	+42.71 pp vs LLaMA-2-chat-7B	All scenarios (Table 2)	Tool-oriented fine-tuning gains (Table 2)	Table 2

What To Try In 7 Days

Run ToolEyes (or a small slice) on your models to baseline real tool-use performance.

Add strict output-format enforcement (keywords/JSON) to avoid parser breakage.

Fine-tune or SFT on a small tool-oriented dataset (examples of tool calls + reasoning).

Agent Features

Memory

short-term interaction turns (multi-turn state)

Planning

behavior planning (multi-step planning and summarization)

Tool Use

tool selectionfunction callingparameter filling

Frameworks

ReAct output format (Thought/Action/Action Input)

Is Agentic

Yes

Architectures

Transformer LLMs (LLaMA/Vicuna/GPT series)

Optimization Features

Training Optimization

SFT

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Junjie-Ye/ToolEyes

Data URLs

https://github.com/Junjie-Ye/ToolEyes

Risks & Boundaries

Limitations

No new model is proposed — the work is an evaluation system, not a tool-learning model.

Scoring relies partly on GPT-4 due to cost; only a subset of models was evaluated with human validation sample.

When Not To Use

As the only metric when you need provable, deterministic tool-call correctness for safety-critical apps.

If you require evaluation on tools or domains not covered by the 568-tool library without extending it first.

Failure Modes

Format alignment breaks: redundant sentences or missing keywords stop tool parsing.

Tool hallucinations: models invent tool/parameter names or add escape characters.

Core Entities

Models

LLaMA-2-chat-7BLLaMA-2-chat-13BLLaMA-2-chat-70BVicuna-1.5-7BVicuna-1.5-13BToolLLaMA-2-7B-v1ToolLLaMA-2-7B-v2Text-davinci-003GPT-3.5-turboGPT-4

Metrics

s_overalls_FA (format alignment)s_IC (intent comprehension)s_b-validitys_b-integritys_t-realitys_t-matchs_a-passs_a-quality

Datasets

ToolEyes dataset (382 queries across 7 scenarios, human-crafted)

Benchmarks

ToolEyes

Context Entities

Models

ToolLLaMA (tool-oriented fine-tuned LLaMA)Vicuna (instruction-following fine-tuned LLaMA variants)

Metrics

Welch's ANOVA for scenario varianceHuman-GPT-4 agreement %

Datasets

Tool learning datasets from prior work (e.g., ToolLLaMA training data referenced)

Benchmarks

API-Bank, MetaTool, ToolBench variants (compared in Appendix A / Table 5)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 achieves the highest overall tool-learning score among tested models.

Tool-oriented fine-tuning substantially improves performance versus generic chat models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding