T-Eval: a stepwise benchmark that breaks LLM tool use into six measurable abilities

Overview

Decision SnapshotReady For Pilot

The benchmark offers practical, actionable diagnostics and a human-verified dataset; it is ready for model analysis but not a turnkey safety or deployment test.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 45%

Authors

Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, Feng Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

T-Eval gives runnable, per-skill diagnostics for building LLM-based tool agents so teams can pinpoint whether problems come from planning, choosing tools, formatting requests, or checking results.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead Data Scientist

Summary TLDR

T-Eval is a new benchmark that evaluates how well LLMs act as tool agents by splitting tool use into six concrete abilities: PLAN, REASON, RETRIEVE, UNDERSTAND, INSTRUCT, and REVIEW. The authors build a human-verified dataset (23,305 test cases from ~553 annotated queries), provide per-ability metrics (string and strict JSON formats), and show fine-grained gaps missed by holistic tests. Results highlight that GPT-4 leads overall (86.4), open-source models improve with scale (e.g., Qwen-72B 71.4), and common weaknesses are format following, retrieval, and review. The code and benchmark are available at the project repo.

Problem Statement

Current tool-use benchmarks judge only final outputs or single API calls. That hides which internal skills (planning, selecting tools, building parameters, checking results) fail in multi-step, real-world tool usage. We need a stable, fine-grained test that isolates each sub-skill and reduces variance from live APIs.

Main Contribution

A step-by-step benchmark (T-Eval) that decomposes tool utilization into six measurable abilities: INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW.

A multi-agent data generation pipeline plus human verification to produce golden solution paths and tool responses, yielding 23,305 test cases across the six abilities.

Key Findings

Top commercial models lead overall tool-use performance.

NumbersGPT-4 overall 86.4; GPT-3.5 84.0; Claude2 78.8

Practical UseFor production tool agents, start with API models (GPT-4/3.5) when possible; they need less specialized tuning to get reliable multi-step tool behavior.

Evidence RefTable 1 (main results)

Open-source models improve with scale but still trail best API models.

NumbersQwen-7B 59.5 → Qwen-72B 71.4 overall

Practical UseScaling open models helps, but expect ~10–20 point gaps vs top APIs; plan additional tuning or validation before deploying open models as agents.

Evidence RefFig.3 and Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall score (selected models)	GPT-4 86.4; GPT-3.5 84.0; Claude2 78.8; Qwen-72B 71.4	—	—	T-Eval overall average	Table 1 overall column	Table 1
Review ability (classification of tool responses)	GPT-4 94.5; GPT-3.5 75.6; many open models 50–63	Review best=GPT-4	Open models − GPT-4 ≈ 30–45 pts	REVIEW subset (choice)	Table 1 REVIEW column	Table 1

What To Try In 7 Days

Run T-Eval subsets on your model to see if failures are format-related (INSTRUCT) or functional (RETRIEVE/REVIEW).

If JSON outputs fail, add a format-repair layer or fine-tune on format-specific examples.

Prioritize training data and retrieval supervision before scaling model size to reduce retrieval and review gaps.

Agent Features

Planning

PLAN measured via action-sequence similarity and orderingplans evaluated as ordered action lists

Tool Use

INSTRUCT (formatting tool calls)RETRIEVE (choose tool)UNDERSTAND (fill parameters)REVIEW (judge response)

Frameworks

ReAct (used for end-to-end agent evaluation)

Architectures

multi-agent annotation pipeline (planner/executor/reviewer)

Collaboration

multi-agent pipeline for data annotation (separate planner/executor/reviewer roles)

Optimization Features

Training Optimization

human-in-loop refinement for instruction generationmulti-agent annotation to reduce annotation errors

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/open-compass/T-Eval

Data URLs

https://github.com/open-compass/T-Eval

Risks & Boundaries

Limitations

Constructed tool documentation is synthetic and fixed, so results do not capture failures caused by live API instability or temporal changes.

Inclusive string vs JSON protocols can under- or over-estimate ability depending on whether format-following is a priority.

When Not To Use

When you need evaluation against live, changing external APIs or real-time web state.

When you require safety or adversarial testing beyond format and selection (e.g., prompt injection stress tests).

Failure Modes

Format parsing failures (models produce unparsable JSON) that artificially lower strict scores.

Wrong tool choice (RETRIEVE errors) even when plan and reasoning are correct.

Core Entities

Models

GPT-4gpt-3.5-turboClaude2Qwen-72BQwen-14BQwen-7BLLaMA2-7BLLaMA2-13BLLaMA2-70BBaichuan2-7BBaichuan2-13BMistral-7BVicuna-7BVicuna-13BWizardLM-13BWizardLM-70BCodeLLaMA-7BInternLM-7BAgentLM-7BChatGLM3-6B

Metrics

Per-ability scores (INSTRUCT, PLAN, REASON, RETRIEVE, UNDERSTAND, REVIEW)AccuracyEnd-to-end win rate (ToolBench-style comparison)

Datasets

T-Eval (23,305 test cases)

Benchmarks

T-Eval

Context Entities

Models

GPT-3.5 (gpt-3.5-turbo-16k)GPT-4 (gpt-4-1106-preview)

Metrics

Win rate judged by GPT-4 (used for cross-check)

Datasets

ToolBenchToolQAAPI-Bank

Benchmarks

ToolBench (win rate comparison used)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top commercial models lead overall tool-use performance.

Open-source models improve with scale but still trail best API models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding