ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

March 2, 20257 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao

Links

Abstract / PDF

Why It Matters For Business

ETAPP and the key-point evaluator reveal where assistants fail to tailor actions or anticipate needs. Use them to catch personalization and proactivity gaps before product launch.

Summary TLDR

This paper introduces ETAPP, a controlled sandbox benchmark (800 cases, 16 user profiles, 33 APIs) that measures two qualities for personal assistants: Personalization (tailoring to user preferences) and Proactivity (anticipating unstated needs). It also proposes a key-point-based LLM evaluation method that hands annotated task key points to the evaluating LLM to reduce bias and improve agreement with humans (mean diff 0.01; 89.6% of scores within 1 point). Experiments show strong models still struggle with proactive behavior and tool retrieval. Fine-tuning on a small set of annotated tool-invocation traces helps in-domain but gives limited out-of-domain gains. Code and sandbox artifacts are

Problem Statement

Existing benchmarks focus on text personalization or tool use separately. There is no standard, controlled test that measures how well an LLM both uses tools and adapts that use to a specific user's preferences while being proactive. Automatic LLM-based graders are also noisy for these dimensions.

Main Contribution

ETAPP benchmark and sandbox: 800 test cases from 16 user profiles, 33 functional APIs across 9 categories to evaluate personalization and proactivity.

Key-point-based LLM evaluation: provide human-written key points per case to the evaluator LLM to reduce scoring bias and increase alignment with human judgments.

Analysis of tool-invoking methods and fine-tuning: compare FC, ReAct and E-ReAct; show E-ReAct and reasoning traces improve personalization/proactivity; fine-tuning helps in-domain but has limited OOD gains.

Key Findings

ETAPP dataset and sandbox provide a repeatable test for personalized tool invocation.

Numbers800 test cases; 16 user profiles; 33 APIs; 9 categories (Table 5)

Key-point-based LLM evaluation substantially improves alignment with human scoring.

NumbersMean difference 0.01; LoA width 1.89 vs 2.24; 89.6% samples ≤1-point diff

Models score poorly on Proactivity and drop markedly when they must find tools.

NumbersExample: GPT-4o PTV 1.61 (Tool-Given) → 1.08 (Tool-Retrieval); many models average PTV ≤1.87

Fine-tuning on annotated tool-invocation traces improves in-domain performance but less so for new scenarios.

NumbersReAct PRC rise example: 2.76 → 3.47 (↑25.8%); Proactivity gains up to ~77.7% in some splits

Results

Tool-Retrieval degrades tool use compared to Tool-Given

ValueExample: GPT-4o PRC 3.95 → 2.67 (drop 1.28 points)

BaselineTool-Given

Key-point evaluator vs human agreement (Proactivity)

ValueMean diff 0.01; 89.6% samples within ±1 point

BaselineFree evaluation (no key points)

Overall Proactivity scores are low across models

ValueMany models PTV ≤ 1.87 (Tool-Given); examples: GPT-4o PTV 1.61

BaselineTool-Given scores

Fine-tuning on reasoning traces helps in-domain

ValueReAct PRC example improvement: +25.8% (2.76 → 3.47)

BaselineVanilla model before FT

Who Should Care

What To Try In 7 Days

Run a subset of ETAPP queries against your assistant to find tool-retrieval and proactivity gaps.

Add per-case key points to your automated evaluation pipeline to reduce scorer drift versus human raters.

Prototype an E-ReAct prompt that forces explicit planning (key personalization/proactivity points) before tool calls.

Agent Features

Memory

  • long-term user profile
  • tool-utilizing preferences (category-level)
  • short-term user state and 9-day interaction history

Planning

  • explicit keypoint planning (E-ReAct)
  • one-shot example prompting for ReAct

Tool Use

  • Function Calling (FC)
  • ReAct
  • E-ReAct
  • tool discovery via search_tools/get_tool_doc

Frameworks

  • ETAPP sandbox with 33 APIs
  • key-point-based LLM evaluation

Is Agentic

true

Architectures

  • personal tool-augmented LLM agent
  • sandboxed API executor

Collaboration

  • LLM as controller invoking external APIs

Optimization Features

Token Efficiency

  • Needed preferences input reduces token usage vs All (2393 vs 3444 tokens)

System Optimization

  • stable sandbox to isolate external variability

Training Optimization

  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not cover multimodal tasks; focuses on API/text interactions only.
  • Preference modeling is coarse: selection preference among similar tools not modeled.
  • Reasoning-model subset evaluation limited by budget; some models only tested on subsets.
  • Fine-tuning experiments use a small annotated set (200 instructions), limiting generality.

When Not To Use

  • Not suitable as a sole test for multimodal agents or vision/voice workflows.
  • Not a final safety clearance — it checks personalization/proactivity, not full safety or privacy risks.

Failure Modes

  • Models answer directly without justifying tool choice or planning (weak proactivity).
  • Tool retrieval or planning fails in longer, multi-step tasks.
  • Automated evaluators without key points overestimate personalization and proactivity.

Core Entities

Models

  • gpt-4o
  • DeepSeek-V3
  • Qwen2.5-72B-Instruct
  • Llama-3.1-70B-Instruct
  • watt-tool-70B
  • o1-preview
  • o1-mini
  • DeepSeek-R1
  • DeepSeek-R1-Distill-Qwen-32B
  • QwQ-32B-Preview
  • Qwen2.5-7B-Instruct (fine-tuned)

Metrics

  • Procedure (PRC)
  • Personalization (PSN)
  • Proactivity (PTV)

Datasets

  • ETAPP (this paper)

Benchmarks

  • ETAPP