Overview
The benchmark is large, validated against humans, and provides reusable prompts and code; it is ready for evaluation but not a drop-in for full production personalization pipelines because profiles are semi-synthetic and actions are simulated.
Citations0
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
PersonaLens gives a practical, large-scale way to measure personalization separate from task success; use it to benchmark assistants, prioritize memory/retrieval investments, and avoid overestimating personalization from high task completion.
Who Should Care
Summary TLDR
PersonaLens is a large benchmark for measuring personalization in task-oriented conversational assistants. It provides 1,500 semi-synthetic user profiles, 111 tasks across 20 domains and 122,133 user-task scenarios, plus two LLM agents: a user agent to simulate realistic dialogues and a judge agent to score personalization, coherence and task success. Validation shows strong agreement between the judge and humans for task completion and coherence, but low personalization overall (typical scores ≈2/4). Interaction history (past dialogues) yields the largest personalization gains. The benchmark, prompts, and data are released for reproducible evaluation.
Problem Statement
Existing personalization tests either target chit-chat, non-conversational tasks, or narrow domains and do not capture multi-turn, task-oriented personalization. We need a scalable, realistic benchmark that measures how assistants use past interactions, preferences and situational context to personalize while completing tasks.
Main Contribution
PersonaLens: a benchmark with 1,500 user profiles, 111 tasks over 20 domains and 122,133 user-task scenarios.
Two LLM agents: a user agent that simulates multi-turn task dialogues and a judge agent that evaluates personalization, dialogue quality and task completion.
Key Findings
Large, multi-domain benchmark: 122,133 user-task scenarios from 1,500 profiles and 111 tasks across 20 domains.
Automated judge aligns well with human raters on task completion and coherence.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| TCR (T_SD) - Claude 3 Sonnet | 95.98% | — | — | T_SD (single-domain) | High task completion for Claude 3 Sonnet on single-domain tasks | Table 3 |
| Personalization (P) - Claude 3 Sonnet (T_SD) | 2.13 / 4 | — | — | T_SD | Low-to-moderate personalization despite high TCR | Table 3 |
What To Try In 7 Days
Run PersonaLens on your assistant on a small sample to measure current P vs TCR gaps.
Add retrieval of past interactions (or a simple conversation memory) and rerun the Base→Base+I ablation to estimate gains.
Use the judge prompts and evaluate a subset manually to confirm judge alignment with your users.
Agent Features
Memory
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Profiles and dialogues are LLM-generated and may inherit LLM biases despite mitigation.
Only text-based interactions; no multimodal personalization (voice, images).
When Not To Use
For assessing real-world end-to-end action execution (payments, bookings) because actions are simulated.
For multimodal personalization tasks requiring audio or images.
Failure Modes
Judge bias: judge agent may reflect training data biases and over/under-rate personalization.
Synthetic-profile artifacts: generated preferences might not match true user behavior.

