Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
ToolEyes quantifies how well models actually use real APIs and multi-step tools, revealing that tool-specific fine-tuning and output-format controls matter more than raw model size for production tool integration.
Summary TLDR
This paper introduces ToolEyes, a fine-grained evaluation system for how well LLMs learn and use external tools in real-world tasks. ToolEyes defines seven realistic scenarios (text generation, data understanding, real-time search, application manipulation, personal life, information retrieval, financial transactions), a library of 568 tools, and five capability dimensions: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. The authors evaluate 10 LLMs (open-source, tool-oriented, closed-source). Key takeaways: closed-source models (GPT-4) lead but still score poorly on planning; tool-oriented fine-tuning helps; larger model size can worsen “f
Problem Statement
Current tool-use evaluations either require pre-specified tool-answer mappings or only check final outcomes. That misses core cognitive skills needed to use tools—formatting, understanding intent, planning multi-step actions, selecting correct tools/parameters, and organizing final answers—especially in messy real-world tool ecosystems.
Main Contribution
ToolEyes: a fine-grained evaluation system covering seven real-world scenarios and a 568-tool library.
A five-dimension rubric for tool learning: format alignment, intent comprehension, behavior planning, tool selection, answer organization.
An empirical study of 10 LLMs showing scenario preferences, weak planning skills, and cases where larger models perform worse for tool learning.
Key Findings
GPT-4 achieves the highest overall tool-learning score among tested models.
Tool-oriented fine-tuning substantially improves performance versus generic chat models.
Behavior planning is weak across models, even for the best model.
Increasing model size sometimes worsens tool-learning performance due to format and behavioral issues.
Automated scoring using GPT-4 aligns well with humans.
Results
s_overall (GPT-4)
s_overall (ToolLLaMA-2-7B-v2)
s_overall (Vicuna-1.5-7B)
s_overall (LLaMA-2-chat-70B)
Average interaction turns (LLaMA-2-chat-7B)
Behavior planning (GPT-4)
Who Should Care
What To Try In 7 Days
Run ToolEyes (or a small slice) on your models to baseline real tool-use performance.
Add strict output-format enforcement (keywords/JSON) to avoid parser breakage.
Fine-tune or SFT on a small tool-oriented dataset (examples of tool calls + reasoning).
Agent Features
Memory
- short-term interaction turns (multi-turn state)
Planning
- behavior planning (multi-step planning and summarization)
Tool Use
- tool selection
- function calling
- parameter filling
Frameworks
- ReAct output format (Thought/Action/Action Input)
Is Agentic
true
Architectures
- Transformer LLMs (LLaMA/Vicuna/GPT series)
Optimization Features
Training Optimization
- SFT
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- No new model is proposed — the work is an evaluation system, not a tool-learning model.
- Scoring relies partly on GPT-4 due to cost; only a subset of models was evaluated with human validation sample.
When Not To Use
- As the only metric when you need provable, deterministic tool-call correctness for safety-critical apps.
- If you require evaluation on tools or domains not covered by the 568-tool library without extending it first.
Failure Modes
- Format alignment breaks: redundant sentences or missing keywords stop tool parsing.
- Tool hallucinations: models invent tool/parameter names or add escape characters.
- Parameter hallucination: models invent API keys or required parameters.
- Scaling pitfalls: larger models may amplify conversational habits that harm structured outputs.
Core Entities
Models
- LLaMA-2-chat-7B
- LLaMA-2-chat-13B
- LLaMA-2-chat-70B
- Vicuna-1.5-7B
- Vicuna-1.5-13B
- ToolLLaMA-2-7B-v1
- ToolLLaMA-2-7B-v2
- Text-davinci-003
- GPT-3.5-turbo
- GPT-4
Metrics
- s_overall
- s_FA (format alignment)
- s_IC (intent comprehension)
- s_b-validity
- s_b-integrity
- s_t-reality
- s_t-match
- s_a-pass
- s_a-quality
Datasets
- ToolEyes dataset (382 queries across 7 scenarios, human-crafted)
Benchmarks
- ToolEyes
Context Entities
Models
- ToolLLaMA (tool-oriented fine-tuned LLaMA)
- Vicuna (instruction-following fine-tuned LLaMA variants)
Metrics
- Welch's ANOVA for scenario variance
- Human-GPT-4 agreement %
Datasets
- Tool learning datasets from prior work (e.g., ToolLLaMA training data referenced)
Benchmarks
- API-Bank, MetaTool, ToolBench variants (compared in Appendix A / Table 5)

