Overview
The methodology uses a large, human-crafted tool library and queries, with automated GPT-4 scoring validated against humans; results are reliable for comparative benchmarking but limited by the chosen models and GPT-4 evaluation costs.
Citations1
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
ToolEyes quantifies how well models actually use real APIs and multi-step tools, revealing that tool-specific fine-tuning and output-format controls matter more than raw model size for production tool integration.
Who Should Care
Summary TLDR
This paper introduces ToolEyes, a fine-grained evaluation system for how well LLMs learn and use external tools in real-world tasks. ToolEyes defines seven realistic scenarios (text generation, data understanding, real-time search, application manipulation, personal life, information retrieval, financial transactions), a library of 568 tools, and five capability dimensions: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. The authors evaluate 10 LLMs (open-source, tool-oriented, closed-source). Key takeaways: closed-source models (GPT-4) lead but still score poorly on planning; tool-oriented fine-tuning helps; larger model size can worsen “f
Problem Statement
Current tool-use evaluations either require pre-specified tool-answer mappings or only check final outcomes. That misses core cognitive skills needed to use tools—formatting, understanding intent, planning multi-step actions, selecting correct tools/parameters, and organizing final answers—especially in messy real-world tool ecosystems.
Main Contribution
ToolEyes: a fine-grained evaluation system covering seven real-world scenarios and a 568-tool library.
A five-dimension rubric for tool learning: format alignment, intent comprehension, behavior planning, tool selection, answer organization.
Key Findings
GPT-4 achieves the highest overall tool-learning score among tested models.
Tool-oriented fine-tuning substantially improves performance versus generic chat models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| s_overall (GPT-4) | 70.31% | — | — | All scenarios (Table 2) | GPT-4 overall score from Table 2 | Table 2 |
| s_overall (ToolLLaMA-2-7B-v2) | 56.30% | LLaMA-2-chat-7B 13.59% | +42.71 pp vs LLaMA-2-chat-7B | All scenarios (Table 2) | Tool-oriented fine-tuning gains (Table 2) | Table 2 |
What To Try In 7 Days
Run ToolEyes (or a small slice) on your models to baseline real tool-use performance.
Add strict output-format enforcement (keywords/JSON) to avoid parser breakage.
Fine-tune or SFT on a small tool-oriented dataset (examples of tool calls + reasoning).
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
No new model is proposed — the work is an evaluation system, not a tool-learning model.
Scoring relies partly on GPT-4 due to cost; only a subset of models was evaluated with human validation sample.
When Not To Use
As the only metric when you need provable, deterministic tool-call correctness for safety-critical apps.
If you require evaluation on tools or domains not covered by the 568-tool library without extending it first.
Failure Modes
Format alignment breaks: redundant sentences or missing keywords stop tool parsing.
Tool hallucinations: models invent tool/parameter names or add escape characters.

