Overview
Production Readiness
0.7
Novelty Score
0.45
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
T-Eval gives runnable, per-skill diagnostics for building LLM-based tool agents so teams can pinpoint whether problems come from planning, choosing tools, formatting requests, or checking results.
Summary TLDR
T-Eval is a new benchmark that evaluates how well LLMs act as tool agents by splitting tool use into six concrete abilities: PLAN, REASON, RETRIEVE, UNDERSTAND, INSTRUCT, and REVIEW. The authors build a human-verified dataset (23,305 test cases from ~553 annotated queries), provide per-ability metrics (string and strict JSON formats), and show fine-grained gaps missed by holistic tests. Results highlight that GPT-4 leads overall (86.4), open-source models improve with scale (e.g., Qwen-72B 71.4), and common weaknesses are format following, retrieval, and review. The code and benchmark are available at the project repo.
Problem Statement
Current tool-use benchmarks judge only final outputs or single API calls. That hides which internal skills (planning, selecting tools, building parameters, checking results) fail in multi-step, real-world tool usage. We need a stable, fine-grained test that isolates each sub-skill and reduces variance from live APIs.
Main Contribution
A step-by-step benchmark (T-Eval) that decomposes tool utilization into six measurable abilities: INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW.
A multi-agent data generation pipeline plus human verification to produce golden solution paths and tool responses, yielding 23,305 test cases across the six abilities.
Extensive evaluation across 20 LLMs showing per-ability strengths and bottlenecks and validating consistency with holistic win-rate evaluations.
Key Findings
Top commercial models lead overall tool-use performance.
Open-source models improve with scale but still trail best API models.
JSON-format (strict) output is a major choke point for many models.
Tool retrieval and review are common weakness areas.
Inclusive, multi-difficulty evaluation exposes hidden capability vs format failures.
Results
Overall score (selected models)
Review ability (classification of tool responses)
Retrieval (tool selection) JSON
Instruction following INSTRUCT (string vs JSON examples)
Who Should Care
What To Try In 7 Days
Run T-Eval subsets on your model to see if failures are format-related (INSTRUCT) or functional (RETRIEVE/REVIEW).
If JSON outputs fail, add a format-repair layer or fine-tune on format-specific examples.
Prioritize training data and retrieval supervision before scaling model size to reduce retrieval and review gaps.
Agent Features
Planning
- PLAN measured via action-sequence similarity and ordering
- plans evaluated as ordered action lists
Tool Use
- INSTRUCT (formatting tool calls)
- RETRIEVE (choose tool)
- UNDERSTAND (fill parameters)
- REVIEW (judge response)
Frameworks
- ReAct (used for end-to-end agent evaluation)
Architectures
- multi-agent annotation pipeline (planner/executor/reviewer)
Collaboration
- multi-agent pipeline for data annotation (separate planner/executor/reviewer roles)
Optimization Features
Training Optimization
- human-in-loop refinement for instruction generation
- multi-agent annotation to reduce annotation errors
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Constructed tool documentation is synthetic and fixed, so results do not capture failures caused by live API instability or temporal changes.
- Inclusive string vs JSON protocols can under- or over-estimate ability depending on whether format-following is a priority.
- Annotation scale is limited (~553 annotated queries → 23,305 cases) and may not cover every domain or rare tool behavior.
When Not To Use
- When you need evaluation against live, changing external APIs or real-time web state.
- When you require safety or adversarial testing beyond format and selection (e.g., prompt injection stress tests).
- When your deployment uses tools with undocumented behaviors not represented in T-Eval docs.
Failure Modes
- Format parsing failures (models produce unparsable JSON) that artificially lower strict scores.
- Wrong tool choice (RETRIEVE errors) even when plan and reasoning are correct.
- Incorrect review classification (mislabeling tool success vs error) leading to incorrect stop/continue decisions.
Core Entities
Models
- GPT-4
- gpt-3.5-turbo
- Claude2
- Qwen-72B
- Qwen-14B
- Qwen-7B
- LLaMA2-7B
- LLaMA2-13B
- LLaMA2-70B
- Baichuan2-7B
- Baichuan2-13B
- Mistral-7B
- Vicuna-7B
- Vicuna-13B
- WizardLM-13B
- WizardLM-70B
- CodeLLaMA-7B
- InternLM-7B
- AgentLM-7B
- ChatGLM3-6B
Metrics
- Per-ability scores (INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW)
- Accuracy
- End-to-end win rate (ToolBench-style comparison)
Datasets
- T-Eval (23,305 test cases)
Benchmarks
- T-Eval
Context Entities
Models
- GPT-3.5 (gpt-3.5-turbo-16k)
- GPT-4 (gpt-4-1106-preview)
Metrics
- Win rate judged by GPT-4 (used for cross-check)
Datasets
- ToolBench
- ToolQA
- API-Bank
Benchmarks
- ToolBench (win rate comparison used)

