Overview
The benchmark and key-point evaluator are practical and reproducible; experiments support claims but evaluation is limited to a controlled sandbox and selected models.
Citations0
Evidence Strength0.60
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
ETAPP and the key-point evaluator reveal where assistants fail to tailor actions or anticipate needs. Use them to catch personalization and proactivity gaps before product launch.
Who Should Care
Summary TLDR
This paper introduces ETAPP, a controlled sandbox benchmark (800 cases, 16 user profiles, 33 APIs) that measures two qualities for personal assistants: Personalization (tailoring to user preferences) and Proactivity (anticipating unstated needs). It also proposes a key-point-based LLM evaluation method that hands annotated task key points to the evaluating LLM to reduce bias and improve agreement with humans (mean diff 0.01; 89.6% of scores within 1 point). Experiments show strong models still struggle with proactive behavior and tool retrieval. Fine-tuning on a small set of annotated tool-invocation traces helps in-domain but gives limited out-of-domain gains. Code and sandbox artifacts are
Problem Statement
Existing benchmarks focus on text personalization or tool use separately. There is no standard, controlled test that measures how well an LLM both uses tools and adapts that use to a specific user's preferences while being proactive. Automatic LLM-based graders are also noisy for these dimensions.
Main Contribution
ETAPP benchmark and sandbox: 800 test cases from 16 user profiles, 33 functional APIs across 9 categories to evaluate personalization and proactivity.
Key-point-based LLM evaluation: provide human-written key points per case to the evaluator LLM to reduce scoring bias and increase alignment with human judgments.
Key Findings
ETAPP dataset and sandbox provide a repeatable test for personalized tool invocation.
Key-point-based LLM evaluation substantially improves alignment with human scoring.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Tool-Retrieval degrades tool use compared to Tool-Given | Example: GPT-4o PRC 3.95 → 2.67 (drop 1.28 points) | Tool-Given | −1.28 PRC points | ETAPP; Table 1 | Table 1 shows consistent drops from Tool-Given to Tool-Retrieval | Table 1 |
| Key-point evaluator vs human agreement (Proactivity) | Mean diff 0.01; 89.6% samples within ±1 point | Free evaluation (no key points) | LoA width reduced 1.89 vs 2.24; +39.6 pp ≤1-pt agreement | ETAPP; 96 samples analyzed | Section 4.4; Bland–Altman analysis | Section 4.4; Figures 5–6 |
What To Try In 7 Days
Run a subset of ETAPP queries against your assistant to find tool-retrieval and proactivity gaps.
Add per-case key points to your automated evaluation pipeline to reduce scorer drift versus human raters.
Prototype an E-ReAct prompt that forces explicit planning (key personalization/proactivity points) before tool calls.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Does not cover multimodal tasks; focuses on API/text interactions only.
Preference modeling is coarse: selection preference among similar tools not modeled.
When Not To Use
Not suitable as a sole test for multimodal agents or vision/voice workflows.
Not a final safety clearance — it checks personalization/proactivity, not full safety or privacy risks.
Failure Modes
Models answer directly without justifying tool choice or planning (weak proactivity).
Tool retrieval or planning fails in longer, multi-step tasks.

