Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
PersonaLens gives a practical, large-scale way to measure personalization separate from task success; use it to benchmark assistants, prioritize memory/retrieval investments, and avoid overestimating personalization from high task completion.
Summary TLDR
PersonaLens is a large benchmark for measuring personalization in task-oriented conversational assistants. It provides 1,500 semi-synthetic user profiles, 111 tasks across 20 domains and 122,133 user-task scenarios, plus two LLM agents: a user agent to simulate realistic dialogues and a judge agent to score personalization, coherence and task success. Validation shows strong agreement between the judge and humans for task completion and coherence, but low personalization overall (typical scores ≈2/4). Interaction history (past dialogues) yields the largest personalization gains. The benchmark, prompts, and data are released for reproducible evaluation.
Problem Statement
Existing personalization tests either target chit-chat, non-conversational tasks, or narrow domains and do not capture multi-turn, task-oriented personalization. We need a scalable, realistic benchmark that measures how assistants use past interactions, preferences and situational context to personalize while completing tasks.
Main Contribution
PersonaLens: a benchmark with 1,500 user profiles, 111 tasks over 20 domains and 122,133 user-task scenarios.
Two LLM agents: a user agent that simulates multi-turn task dialogues and a judge agent that evaluates personalization, dialogue quality and task completion.
Validation showing high judge-vs-human agreement for task completion and coherence, and analyses of model scaling, contextual value, and cross-domain personalization patterns.
Public release of dataset, prompts, and evaluation code to enable reproducible personalization evaluation.
Key Findings
Large, multi-domain benchmark: 122,133 user-task scenarios from 1,500 profiles and 111 tasks across 20 domains.
Automated judge aligns well with human raters on task completion and coherence.
Personalization scores are low on average; most assistants cluster around basic personalization.
High task completion rates mask weak personalization trade-offs.
Interaction history drives the largest personalization gains.
Multi-domain tasks reduce both task completion and personalization.
Results
TCR (T_SD) - Claude 3 Sonnet
Personalization (P) - Claude 3 Sonnet (T_SD)
Ablation: add past interactions (Base → Base+I)
Judge vs Human agreement (Task Completion)
Who Should Care
What To Try In 7 Days
Run PersonaLens on your assistant on a small sample to measure current P vs TCR gaps.
Add retrieval of past interactions (or a simple conversation memory) and rerun the Base→Base+I ablation to estimate gains.
Use the judge prompts and evaluate a subset manually to confirm judge alignment with your users.
Agent Features
Memory
- Past interaction summaries (queryable short-term memory)
- Situational context per task
Frameworks
- LLM-as-a-Judge
- Simulated user evaluation pipeline
Is Agentic
true
Architectures
- LLM-powered user agent
- LLM-powered judge agent
Collaboration
- User agent interacts with assistant; judge evaluates dialogue
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Profiles and dialogues are LLM-generated and may inherit LLM biases despite mitigation.
- Only text-based interactions; no multimodal personalization (voice, images).
- Evaluations simulate actions (bookings, purchases) rather than executing them in real systems.
- Some niche domains may need further domain-specific customization.
When Not To Use
- For assessing real-world end-to-end action execution (payments, bookings) because actions are simulated.
- For multimodal personalization tasks requiring audio or images.
- If you need purely human-collected dialogue data without LLM generation.
Failure Modes
- Judge bias: judge agent may reflect training data biases and over/under-rate personalization.
- Synthetic-profile artifacts: generated preferences might not match true user behavior.
- Cross-domain conflicts: multi-domain tasks reveal inconsistent preference application.
- Overfitting to prompts: user agent behavior depends on prompt choices and temperature.
Core Entities
Models
- Claude 3 Haiku
- Claude 3.5 Haiku
- Claude 3 Sonnet
- Claude 3.5 Sonnet
- Llama 3.1 8B Instruct
- Llama 3.1 70B Instruct
- Mistral 7B Instruct
- Mixtral 8x7B Instruct
Metrics
- Task Completion Rate (TCR)
- Personalization (1-4 scale)
- Naturalness (1-5)
- Coherence (1-5)
Datasets
- PersonaLens (this paper)
- PRISM Alignment dataset
Benchmarks
- PersonaChat
- LaMP
- MultiWOZ
- SGD
- LAPS

