T-Eval: a stepwise benchmark that breaks LLM tool use into six measurable abilities

December 21, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark offers practical, actionable diagnostics and a human-verified dataset; it is ready for model analysis but not a turnkey safety or deployment test.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 45%

Authors

Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, Feng Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

T-Eval gives runnable, per-skill diagnostics for building LLM-based tool agents so teams can pinpoint whether problems come from planning, choosing tools, formatting requests, or checking results.

Who Should Care

Summary TLDR

T-Eval is a new benchmark that evaluates how well LLMs act as tool agents by splitting tool use into six concrete abilities: PLAN, REASON, RETRIEVE, UNDERSTAND, INSTRUCT, and REVIEW. The authors build a human-verified dataset (23,305 test cases from ~553 annotated queries), provide per-ability metrics (string and strict JSON formats), and show fine-grained gaps missed by holistic tests. Results highlight that GPT-4 leads overall (86.4), open-source models improve with scale (e.g., Qwen-72B 71.4), and common weaknesses are format following, retrieval, and review. The code and benchmark are available at the project repo.

Problem Statement

Current tool-use benchmarks judge only final outputs or single API calls. That hides which internal skills (planning, selecting tools, building parameters, checking results) fail in multi-step, real-world tool usage. We need a stable, fine-grained test that isolates each sub-skill and reduces variance from live APIs.

Main Contribution

A step-by-step benchmark (T-Eval) that decomposes tool utilization into six measurable abilities: INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW.

A multi-agent data generation pipeline plus human verification to produce golden solution paths and tool responses, yielding 23,305 test cases across the six abilities.

Key Findings

Top commercial models lead overall tool-use performance.

NumbersGPT-4 overall 86.4; GPT-3.5 84.0; Claude2 78.8

Practical UseFor production tool agents, start with API models (GPT-4/3.5) when possible; they need less specialized tuning to get reliable multi-step tool behavior.

Evidence RefTable 1 (main results)

Open-source models improve with scale but still trail best API models.

NumbersQwen-7B 59.5 → Qwen-72B 71.4 overall

Practical UseScaling open models helps, but expect ~10–20 point gaps vs top APIs; plan additional tuning or validation before deploying open models as agents.

Evidence RefFig.3 and Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall score (selected models)GPT-4 86.4; GPT-3.5 84.0; Claude2 78.8; Qwen-72B 71.4T-Eval overall averageTable 1 overall columnTable 1
Review ability (classification of tool responses)GPT-4 94.5; GPT-3.5 75.6; many open models 5063Review best=GPT-4Open models − GPT-4 ≈ 3045 ptsREVIEW subset (choice)Table 1 REVIEW columnTable 1

What To Try In 7 Days

Run T-Eval subsets on your model to see if failures are format-related (INSTRUCT) or functional (RETRIEVE/REVIEW).

If JSON outputs fail, add a format-repair layer or fine-tune on format-specific examples.

Prioritize training data and retrieval supervision before scaling model size to reduce retrieval and review gaps.

Agent Features

Planning
PLAN measured via action-sequence similarity and orderingplans evaluated as ordered action lists
Tool Use
INSTRUCT (formatting tool calls)RETRIEVE (choose tool)UNDERSTAND (fill parameters)REVIEW (judge response)
Frameworks
ReAct (used for end-to-end agent evaluation)
Architectures
multi-agent annotation pipeline (planner/executor/reviewer)
Collaboration
multi-agent pipeline for data annotation (separate planner/executor/reviewer roles)

Optimization Features

Training Optimization
human-in-loop refinement for instruction generationmulti-agent annotation to reduce annotation errors

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Constructed tool documentation is synthetic and fixed, so results do not capture failures caused by live API instability or temporal changes.

Inclusive string vs JSON protocols can under- or over-estimate ability depending on whether format-following is a priority.

When Not To Use

When you need evaluation against live, changing external APIs or real-time web state.

When you require safety or adversarial testing beyond format and selection (e.g., prompt injection stress tests).

Failure Modes

Format parsing failures (models produce unparsable JSON) that artificially lower strict scores.

Wrong tool choice (RETRIEVE errors) even when plan and reasoning are correct.

Core Entities

Models

GPT-4gpt-3.5-turboClaude2Qwen-72BQwen-14BQwen-7BLLaMA2-7BLLaMA2-13BLLaMA2-70BBaichuan2-7BBaichuan2-13BMistral-7BVicuna-7BVicuna-13BWizardLM-13BWizardLM-70BCodeLLaMA-7BInternLM-7BAgentLM-7BChatGLM3-6B

Metrics

Per-ability scores (INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW)AccuracyEnd-to-end win rate (ToolBench-style comparison)

Datasets

T-Eval (23,305 test cases)

Benchmarks

T-Eval

Context Entities

Models

GPT-3.5 (gpt-3.5-turbo-16k)GPT-4 (gpt-4-1106-preview)

Metrics

Win rate judged by GPT-4 (used for cross-check)

Datasets

ToolBenchToolQAAPI-Bank

Benchmarks

ToolBench (win rate comparison used)