PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

June 11, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

0

Authors

Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B. Cohen, Emine Yilmaz

Links

Abstract / PDF

Why It Matters For Business

PersonaLens gives a practical, large-scale way to measure personalization separate from task success; use it to benchmark assistants, prioritize memory/retrieval investments, and avoid overestimating personalization from high task completion.

Summary TLDR

PersonaLens is a large benchmark for measuring personalization in task-oriented conversational assistants. It provides 1,500 semi-synthetic user profiles, 111 tasks across 20 domains and 122,133 user-task scenarios, plus two LLM agents: a user agent to simulate realistic dialogues and a judge agent to score personalization, coherence and task success. Validation shows strong agreement between the judge and humans for task completion and coherence, but low personalization overall (typical scores ≈2/4). Interaction history (past dialogues) yields the largest personalization gains. The benchmark, prompts, and data are released for reproducible evaluation.

Problem Statement

Existing personalization tests either target chit-chat, non-conversational tasks, or narrow domains and do not capture multi-turn, task-oriented personalization. We need a scalable, realistic benchmark that measures how assistants use past interactions, preferences and situational context to personalize while completing tasks.

Main Contribution

PersonaLens: a benchmark with 1,500 user profiles, 111 tasks over 20 domains and 122,133 user-task scenarios.

Two LLM agents: a user agent that simulates multi-turn task dialogues and a judge agent that evaluates personalization, dialogue quality and task completion.

Validation showing high judge-vs-human agreement for task completion and coherence, and analyses of model scaling, contextual value, and cross-domain personalization patterns.

Public release of dataset, prompts, and evaluation code to enable reproducible personalization evaluation.

Key Findings

Large, multi-domain benchmark: 122,133 user-task scenarios from 1,500 profiles and 111 tasks across 20 domains.

Numbers122,133 scenarios; 1,500 profiles; 111 tasks; 20 domains

Automated judge aligns well with human raters on task completion and coherence.

NumbersCohen's Kappa: TC 0.78; Coherence (A) 0.65

Personalization scores are low on average; most assistants cluster around basic personalization.

NumbersPersonalization mostly ≈2/4 across assistants in experiments

High task completion rates mask weak personalization trade-offs.

NumbersClaude 3 Sonnet TCR (T_SD) 95.98% while P ≈2.13/4

Interaction history drives the largest personalization gains.

NumbersAdding past interactions raised P from 2.13→2.59 (T_SD)

Multi-domain tasks reduce both task completion and personalization.

NumbersTCR and P drop from single-domain to multi-domain across models (example: Claude 3 Haiku P 2.20→1.98)

Results

TCR (T_SD) - Claude 3 Sonnet

Value95.98%

Personalization (P) - Claude 3 Sonnet (T_SD)

Value2.13 / 4

Ablation: add past interactions (Base → Base+I)

ValueP 2.13 → 2.59 (T_SD); TCR 95.98% → 96.83%

BaselineBase

Judge vs Human agreement (Task Completion)

ValueCohen's Kappa 0.78; IAA 0.865

Who Should Care

What To Try In 7 Days

Run PersonaLens on your assistant on a small sample to measure current P vs TCR gaps.

Add retrieval of past interactions (or a simple conversation memory) and rerun the Base→Base+I ablation to estimate gains.

Use the judge prompts and evaluate a subset manually to confirm judge alignment with your users.

Agent Features

Memory

  • Past interaction summaries (queryable short-term memory)
  • Situational context per task

Frameworks

  • LLM-as-a-Judge
  • Simulated user evaluation pipeline

Is Agentic

true

Architectures

  • LLM-powered user agent
  • LLM-powered judge agent

Collaboration

  • User agent interacts with assistant; judge evaluates dialogue

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Profiles and dialogues are LLM-generated and may inherit LLM biases despite mitigation.
  • Only text-based interactions; no multimodal personalization (voice, images).
  • Evaluations simulate actions (bookings, purchases) rather than executing them in real systems.
  • Some niche domains may need further domain-specific customization.

When Not To Use

  • For assessing real-world end-to-end action execution (payments, bookings) because actions are simulated.
  • For multimodal personalization tasks requiring audio or images.
  • If you need purely human-collected dialogue data without LLM generation.

Failure Modes

  • Judge bias: judge agent may reflect training data biases and over/under-rate personalization.
  • Synthetic-profile artifacts: generated preferences might not match true user behavior.
  • Cross-domain conflicts: multi-domain tasks reveal inconsistent preference application.
  • Overfitting to prompts: user agent behavior depends on prompt choices and temperature.

Core Entities

Models

  • Claude 3 Haiku
  • Claude 3.5 Haiku
  • Claude 3 Sonnet
  • Claude 3.5 Sonnet
  • Llama 3.1 8B Instruct
  • Llama 3.1 70B Instruct
  • Mistral 7B Instruct
  • Mixtral 8x7B Instruct

Metrics

  • Task Completion Rate (TCR)
  • Personalization (1-4 scale)
  • Naturalness (1-5)
  • Coherence (1-5)

Datasets

  • PersonaLens (this paper)
  • PRISM Alignment dataset

Benchmarks

  • PersonaChat
  • LaMP
  • MultiWOZ
  • SGD
  • LAPS