Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
ETAPP and the key-point evaluator reveal where assistants fail to tailor actions or anticipate needs. Use them to catch personalization and proactivity gaps before product launch.
Summary TLDR
This paper introduces ETAPP, a controlled sandbox benchmark (800 cases, 16 user profiles, 33 APIs) that measures two qualities for personal assistants: Personalization (tailoring to user preferences) and Proactivity (anticipating unstated needs). It also proposes a key-point-based LLM evaluation method that hands annotated task key points to the evaluating LLM to reduce bias and improve agreement with humans (mean diff 0.01; 89.6% of scores within 1 point). Experiments show strong models still struggle with proactive behavior and tool retrieval. Fine-tuning on a small set of annotated tool-invocation traces helps in-domain but gives limited out-of-domain gains. Code and sandbox artifacts are
Problem Statement
Existing benchmarks focus on text personalization or tool use separately. There is no standard, controlled test that measures how well an LLM both uses tools and adapts that use to a specific user's preferences while being proactive. Automatic LLM-based graders are also noisy for these dimensions.
Main Contribution
ETAPP benchmark and sandbox: 800 test cases from 16 user profiles, 33 functional APIs across 9 categories to evaluate personalization and proactivity.
Key-point-based LLM evaluation: provide human-written key points per case to the evaluator LLM to reduce scoring bias and increase alignment with human judgments.
Analysis of tool-invoking methods and fine-tuning: compare FC, ReAct and E-ReAct; show E-ReAct and reasoning traces improve personalization/proactivity; fine-tuning helps in-domain but has limited OOD gains.
Key Findings
ETAPP dataset and sandbox provide a repeatable test for personalized tool invocation.
Key-point-based LLM evaluation substantially improves alignment with human scoring.
Models score poorly on Proactivity and drop markedly when they must find tools.
Fine-tuning on annotated tool-invocation traces improves in-domain performance but less so for new scenarios.
Results
Tool-Retrieval degrades tool use compared to Tool-Given
Key-point evaluator vs human agreement (Proactivity)
Overall Proactivity scores are low across models
Fine-tuning on reasoning traces helps in-domain
Who Should Care
What To Try In 7 Days
Run a subset of ETAPP queries against your assistant to find tool-retrieval and proactivity gaps.
Add per-case key points to your automated evaluation pipeline to reduce scorer drift versus human raters.
Prototype an E-ReAct prompt that forces explicit planning (key personalization/proactivity points) before tool calls.
Agent Features
Memory
- long-term user profile
- tool-utilizing preferences (category-level)
- short-term user state and 9-day interaction history
Planning
- explicit keypoint planning (E-ReAct)
- one-shot example prompting for ReAct
Tool Use
- Function Calling (FC)
- ReAct
- E-ReAct
- tool discovery via search_tools/get_tool_doc
Frameworks
- ETAPP sandbox with 33 APIs
- key-point-based LLM evaluation
Is Agentic
true
Architectures
- personal tool-augmented LLM agent
- sandboxed API executor
Collaboration
- LLM as controller invoking external APIs
Optimization Features
Token Efficiency
- Needed preferences input reduces token usage vs All (2393 vs 3444 tokens)
System Optimization
- stable sandbox to isolate external variability
Training Optimization
- LoRA
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Does not cover multimodal tasks; focuses on API/text interactions only.
- Preference modeling is coarse: selection preference among similar tools not modeled.
- Reasoning-model subset evaluation limited by budget; some models only tested on subsets.
- Fine-tuning experiments use a small annotated set (200 instructions), limiting generality.
When Not To Use
- Not suitable as a sole test for multimodal agents or vision/voice workflows.
- Not a final safety clearance — it checks personalization/proactivity, not full safety or privacy risks.
Failure Modes
- Models answer directly without justifying tool choice or planning (weak proactivity).
- Tool retrieval or planning fails in longer, multi-step tasks.
- Automated evaluators without key points overestimate personalization and proactivity.
Core Entities
Models
- gpt-4o
- DeepSeek-V3
- Qwen2.5-72B-Instruct
- Llama-3.1-70B-Instruct
- watt-tool-70B
- o1-preview
- o1-mini
- DeepSeek-R1
- DeepSeek-R1-Distill-Qwen-32B
- QwQ-32B-Preview
- Qwen2.5-7B-Instruct (fine-tuned)
Metrics
- Procedure (PRC)
- Personalization (PSN)
- Proactivity (PTV)
Datasets
- ETAPP (this paper)
Benchmarks
- ETAPP

