T-Eval: a stepwise benchmark that breaks LLM tool use into six measurable abilities

December 21, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

3

Authors

Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, Feng Zhao

Links

Abstract / PDF

Why It Matters For Business

T-Eval gives runnable, per-skill diagnostics for building LLM-based tool agents so teams can pinpoint whether problems come from planning, choosing tools, formatting requests, or checking results.

Summary TLDR

T-Eval is a new benchmark that evaluates how well LLMs act as tool agents by splitting tool use into six concrete abilities: PLAN, REASON, RETRIEVE, UNDERSTAND, INSTRUCT, and REVIEW. The authors build a human-verified dataset (23,305 test cases from ~553 annotated queries), provide per-ability metrics (string and strict JSON formats), and show fine-grained gaps missed by holistic tests. Results highlight that GPT-4 leads overall (86.4), open-source models improve with scale (e.g., Qwen-72B 71.4), and common weaknesses are format following, retrieval, and review. The code and benchmark are available at the project repo.

Problem Statement

Current tool-use benchmarks judge only final outputs or single API calls. That hides which internal skills (planning, selecting tools, building parameters, checking results) fail in multi-step, real-world tool usage. We need a stable, fine-grained test that isolates each sub-skill and reduces variance from live APIs.

Main Contribution

A step-by-step benchmark (T-Eval) that decomposes tool utilization into six measurable abilities: INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW.

A multi-agent data generation pipeline plus human verification to produce golden solution paths and tool responses, yielding 23,305 test cases across the six abilities.

Extensive evaluation across 20 LLMs showing per-ability strengths and bottlenecks and validating consistency with holistic win-rate evaluations.

Key Findings

Top commercial models lead overall tool-use performance.

NumbersGPT-4 overall 86.4; GPT-3.5 84.0; Claude2 78.8

Open-source models improve with scale but still trail best API models.

NumbersQwen-7B 59.5 → Qwen-72B 71.4 overall

JSON-format (strict) output is a major choke point for many models.

NumbersQwen-72B understand string 84.5 vs JSON 66.1 (example); many models drop >20 pts on JSON

Tool retrieval and review are common weakness areas.

NumbersQwen-72B retrieval JSON 65.0 vs GPT-3.5 86.2; GPT-4 review 94.5 vs many models 50–60

Inclusive, multi-difficulty evaluation exposes hidden capability vs format failures.

NumbersDataset has both string and JSON protocols; some models score high on string but low on JSON (e.g., Baichuan2-13B plan:

Results

Overall score (selected models)

ValueGPT-4 86.4; GPT-3.5 84.0; Claude2 78.8; Qwen-72B 71.4

Review ability (classification of tool responses)

ValueGPT-4 94.5; GPT-3.5 75.6; many open models 50–63

BaselineReview best=GPT-4

Retrieval (tool selection) JSON

ValueGPT-3.5 86.2; Qwen-72B 65.0; Qwen-14B 55.3

BaselineGPT-3.5

Instruction following INSTRUCT (string vs JSON examples)

ValueGPT-3.5 string 94.1 / JSON 99.1; many open models drop heavily on JSON

BaselineGPT-3.5

Who Should Care

What To Try In 7 Days

Run T-Eval subsets on your model to see if failures are format-related (INSTRUCT) or functional (RETRIEVE/REVIEW).

If JSON outputs fail, add a format-repair layer or fine-tune on format-specific examples.

Prioritize training data and retrieval supervision before scaling model size to reduce retrieval and review gaps.

Agent Features

Planning

  • PLAN measured via action-sequence similarity and ordering
  • plans evaluated as ordered action lists

Tool Use

  • INSTRUCT (formatting tool calls)
  • RETRIEVE (choose tool)
  • UNDERSTAND (fill parameters)
  • REVIEW (judge response)

Frameworks

  • ReAct (used for end-to-end agent evaluation)

Architectures

  • multi-agent annotation pipeline (planner/executor/reviewer)

Collaboration

  • multi-agent pipeline for data annotation (separate planner/executor/reviewer roles)

Optimization Features

Training Optimization

  • human-in-loop refinement for instruction generation
  • multi-agent annotation to reduce annotation errors

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Constructed tool documentation is synthetic and fixed, so results do not capture failures caused by live API instability or temporal changes.
  • Inclusive string vs JSON protocols can under- or over-estimate ability depending on whether format-following is a priority.
  • Annotation scale is limited (~553 annotated queries → 23,305 cases) and may not cover every domain or rare tool behavior.

When Not To Use

  • When you need evaluation against live, changing external APIs or real-time web state.
  • When you require safety or adversarial testing beyond format and selection (e.g., prompt injection stress tests).
  • When your deployment uses tools with undocumented behaviors not represented in T-Eval docs.

Failure Modes

  • Format parsing failures (models produce unparsable JSON) that artificially lower strict scores.
  • Wrong tool choice (RETRIEVE errors) even when plan and reasoning are correct.
  • Incorrect review classification (mislabeling tool success vs error) leading to incorrect stop/continue decisions.

Core Entities

Models

  • GPT-4
  • gpt-3.5-turbo
  • Claude2
  • Qwen-72B
  • Qwen-14B
  • Qwen-7B
  • LLaMA2-7B
  • LLaMA2-13B
  • LLaMA2-70B
  • Baichuan2-7B
  • Baichuan2-13B
  • Mistral-7B
  • Vicuna-7B
  • Vicuna-13B
  • WizardLM-13B
  • WizardLM-70B
  • CodeLLaMA-7B
  • InternLM-7B
  • AgentLM-7B
  • ChatGLM3-6B

Metrics

  • Per-ability scores (INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW)
  • Accuracy
  • End-to-end win rate (ToolBench-style comparison)

Datasets

  • T-Eval (23,305 test cases)

Benchmarks

  • T-Eval

Context Entities

Models

  • GPT-3.5 (gpt-3.5-turbo-16k)
  • GPT-4 (gpt-4-1106-preview)

Metrics

  • Win rate judged by GPT-4 (used for cross-check)

Datasets

  • ToolBench
  • ToolQA
  • API-Bank

Benchmarks

  • ToolBench (win rate comparison used)