ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Overview

Decision SnapshotNeeds Validation

The benchmark and key-point evaluator are practical and reproducible; experiments support claims but evaluation is limited to a controlled sandbox and selected models.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ETAPP and the key-point evaluator reveal where assistants fail to tailor actions or anticipate needs. Use them to catch personalization and proactivity gaps before product launch.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Founder

Summary TLDR

This paper introduces ETAPP, a controlled sandbox benchmark (800 cases, 16 user profiles, 33 APIs) that measures two qualities for personal assistants: Personalization (tailoring to user preferences) and Proactivity (anticipating unstated needs). It also proposes a key-point-based LLM evaluation method that hands annotated task key points to the evaluating LLM to reduce bias and improve agreement with humans (mean diff 0.01; 89.6% of scores within 1 point). Experiments show strong models still struggle with proactive behavior and tool retrieval. Fine-tuning on a small set of annotated tool-invocation traces helps in-domain but gives limited out-of-domain gains. Code and sandbox artifacts are

Problem Statement

Existing benchmarks focus on text personalization or tool use separately. There is no standard, controlled test that measures how well an LLM both uses tools and adapts that use to a specific user's preferences while being proactive. Automatic LLM-based graders are also noisy for these dimensions.

Main Contribution

ETAPP benchmark and sandbox: 800 test cases from 16 user profiles, 33 functional APIs across 9 categories to evaluate personalization and proactivity.

Key-point-based LLM evaluation: provide human-written key points per case to the evaluator LLM to reduce scoring bias and increase alignment with human judgments.

Key Findings

ETAPP dataset and sandbox provide a repeatable test for personalized tool invocation.

Numbers800 test cases; 16 user profiles; 33 APIs; 9 categories (Table 5)

Practical UseRun ETAPP to stress-test assistants for both personalization and proactivity in a stable, repeatable environment.

Evidence RefSection 2; Table 5

Key-point-based LLM evaluation substantially improves alignment with human scoring.

NumbersMean difference 0.01; LoA width 1.89 vs 2.24; 89.6% samples ≤1-point diff

Practical UseProvide per-case annotated key points to any automated evaluator to get scores closer to human judgment.

Evidence RefSection 4.4; Figures 5–6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Tool-Retrieval degrades tool use compared to Tool-Given	Example: GPT-4o PRC 3.95 → 2.67 (drop 1.28 points)	Tool-Given	−1.28 PRC points	ETAPP; Table 1	Table 1 shows consistent drops from Tool-Given to Tool-Retrieval	Table 1
Key-point evaluator vs human agreement (Proactivity)	Mean diff 0.01; 89.6% samples within ±1 point	Free evaluation (no key points)	LoA width reduced 1.89 vs 2.24; +39.6 pp ≤1-pt agreement	ETAPP; 96 samples analyzed	Section 4.4; Bland–Altman analysis	Section 4.4; Figures 5–6

What To Try In 7 Days

Run a subset of ETAPP queries against your assistant to find tool-retrieval and proactivity gaps.

Add per-case key points to your automated evaluation pipeline to reduce scorer drift versus human raters.

Prototype an E-ReAct prompt that forces explicit planning (key personalization/proactivity points) before tool calls.

Agent Features

Memory

long-term user profiletool-utilizing preferences (category-level)short-term user state and 9-day interaction history

Planning

explicit keypoint planning (E-ReAct)one-shot example prompting for ReAct

Tool Use

Function Calling (FC)ReActE-ReActtool discovery via search_tools/get_tool_doc

Frameworks

ETAPP sandbox with 33 APIskey-point-based LLM evaluation

Is Agentic

Yes

Architectures

personal tool-augmented LLM agentsandboxed API executor

Collaboration

LLM as controller invoking external APIs

Optimization Features

Token Efficiency

Needed preferences input reduces token usage vs All (2393 vs 3444 tokens)

System Optimization

stable sandbox to isolate external variability

Training Optimization

LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hypasd-art/ETAPP

Data URLs

https://github.com/hypasd-art/ETAPP

Risks & Boundaries

Limitations

Does not cover multimodal tasks; focuses on API/text interactions only.

Preference modeling is coarse: selection preference among similar tools not modeled.

When Not To Use

Not suitable as a sole test for multimodal agents or vision/voice workflows.

Not a final safety clearance — it checks personalization/proactivity, not full safety or privacy risks.

Failure Modes

Models answer directly without justifying tool choice or planning (weak proactivity).

Tool retrieval or planning fails in longer, multi-step tasks.

Core Entities

Models

gpt-4oDeepSeek-V3Qwen2.5-72B-InstructLlama-3.1-70B-Instructwatt-tool-70Bo1-previewo1-miniDeepSeek-R1DeepSeek-R1-Distill-Qwen-32BQwQ-32B-PreviewQwen2.5-7B-Instruct (fine-tuned)

Metrics

Procedure (PRC)Personalization (PSN)Proactivity (PTV)

Datasets

ETAPP (this paper)

Benchmarks

ETAPP

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ETAPP dataset and sandbox provide a repeatable test for personalized tool invocation.

Key-point-based LLM evaluation substantially improves alignment with human scoring.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding