ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

March 2, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark and key-point evaluator are practical and reproducible; experiments support claims but evaluation is limited to a controlled sandbox and selected models.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ETAPP and the key-point evaluator reveal where assistants fail to tailor actions or anticipate needs. Use them to catch personalization and proactivity gaps before product launch.

Who Should Care

Summary TLDR

This paper introduces ETAPP, a controlled sandbox benchmark (800 cases, 16 user profiles, 33 APIs) that measures two qualities for personal assistants: Personalization (tailoring to user preferences) and Proactivity (anticipating unstated needs). It also proposes a key-point-based LLM evaluation method that hands annotated task key points to the evaluating LLM to reduce bias and improve agreement with humans (mean diff 0.01; 89.6% of scores within 1 point). Experiments show strong models still struggle with proactive behavior and tool retrieval. Fine-tuning on a small set of annotated tool-invocation traces helps in-domain but gives limited out-of-domain gains. Code and sandbox artifacts are

Problem Statement

Existing benchmarks focus on text personalization or tool use separately. There is no standard, controlled test that measures how well an LLM both uses tools and adapts that use to a specific user's preferences while being proactive. Automatic LLM-based graders are also noisy for these dimensions.

Main Contribution

ETAPP benchmark and sandbox: 800 test cases from 16 user profiles, 33 functional APIs across 9 categories to evaluate personalization and proactivity.

Key-point-based LLM evaluation: provide human-written key points per case to the evaluator LLM to reduce scoring bias and increase alignment with human judgments.

Key Findings

ETAPP dataset and sandbox provide a repeatable test for personalized tool invocation.

Numbers800 test cases; 16 user profiles; 33 APIs; 9 categories (Table 5)

Practical UseRun ETAPP to stress-test assistants for both personalization and proactivity in a stable, repeatable environment.

Evidence RefSection 2; Table 5

Key-point-based LLM evaluation substantially improves alignment with human scoring.

NumbersMean difference 0.01; LoA width 1.89 vs 2.24; 89.6% samples ≤1-point diff

Practical UseProvide per-case annotated key points to any automated evaluator to get scores closer to human judgment.

Evidence RefSection 4.4; Figures 5–6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Tool-Retrieval degrades tool use compared to Tool-GivenExample: GPT-4o PRC 3.952.67 (drop 1.28 points)Tool-Given−1.28 PRC pointsETAPP; Table 1Table 1 shows consistent drops from Tool-Given to Tool-RetrievalTable 1
Key-point evaluator vs human agreement (Proactivity)Mean diff 0.01; 89.6% samples within ±1 pointFree evaluation (no key points)LoA width reduced 1.89 vs 2.24; +39.6 pp ≤1-pt agreementETAPP; 96 samples analyzedSection 4.4; Bland–Altman analysisSection 4.4; Figures 5–6

What To Try In 7 Days

Run a subset of ETAPP queries against your assistant to find tool-retrieval and proactivity gaps.

Add per-case key points to your automated evaluation pipeline to reduce scorer drift versus human raters.

Prototype an E-ReAct prompt that forces explicit planning (key personalization/proactivity points) before tool calls.

Agent Features

Memory
long-term user profiletool-utilizing preferences (category-level)short-term user state and 9-day interaction history
Planning
explicit keypoint planning (E-ReAct)one-shot example prompting for ReAct
Tool Use
Function Calling (FC)ReActE-ReActtool discovery via search_tools/get_tool_doc
Frameworks
ETAPP sandbox with 33 APIskey-point-based LLM evaluation
Is Agentic

Yes

Architectures
personal tool-augmented LLM agentsandboxed API executor
Collaboration
LLM as controller invoking external APIs

Optimization Features

Token Efficiency
Needed preferences input reduces token usage vs All (2393 vs 3444 tokens)
System Optimization
stable sandbox to isolate external variability
Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Does not cover multimodal tasks; focuses on API/text interactions only.

Preference modeling is coarse: selection preference among similar tools not modeled.

When Not To Use

Not suitable as a sole test for multimodal agents or vision/voice workflows.

Not a final safety clearance — it checks personalization/proactivity, not full safety or privacy risks.

Failure Modes

Models answer directly without justifying tool choice or planning (weak proactivity).

Tool retrieval or planning fails in longer, multi-step tasks.

Core Entities

Models

gpt-4oDeepSeek-V3Qwen2.5-72B-InstructLlama-3.1-70B-Instructwatt-tool-70Bo1-previewo1-miniDeepSeek-R1DeepSeek-R1-Distill-Qwen-32BQwQ-32B-PreviewQwen2.5-7B-Instruct (fine-tuned)

Metrics

Procedure (PRC)Personalization (PSN)Proactivity (PTV)

Datasets

ETAPP (this paper)

Benchmarks

ETAPP