PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Overview

Decision SnapshotNeeds Validation

The benchmark is large, validated against humans, and provides reusable prompts and code; it is ready for evaluation but not a drop-in for full production personalization pipelines because profiles are semi-synthetic and actions are simulated.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B. Cohen, Emine Yilmaz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

PersonaLens gives a practical, large-scale way to measure personalization separate from task success; use it to benchmark assistants, prioritize memory/retrieval investments, and avoid overestimating personalization from high task completion.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Engineering Lead

Summary TLDR

PersonaLens is a large benchmark for measuring personalization in task-oriented conversational assistants. It provides 1,500 semi-synthetic user profiles, 111 tasks across 20 domains and 122,133 user-task scenarios, plus two LLM agents: a user agent to simulate realistic dialogues and a judge agent to score personalization, coherence and task success. Validation shows strong agreement between the judge and humans for task completion and coherence, but low personalization overall (typical scores ≈2/4). Interaction history (past dialogues) yields the largest personalization gains. The benchmark, prompts, and data are released for reproducible evaluation.

Problem Statement

Existing personalization tests either target chit-chat, non-conversational tasks, or narrow domains and do not capture multi-turn, task-oriented personalization. We need a scalable, realistic benchmark that measures how assistants use past interactions, preferences and situational context to personalize while completing tasks.

Main Contribution

PersonaLens: a benchmark with 1,500 user profiles, 111 tasks over 20 domains and 122,133 user-task scenarios.

Two LLM agents: a user agent that simulates multi-turn task dialogues and a judge agent that evaluates personalization, dialogue quality and task completion.

Key Findings

Large, multi-domain benchmark: 122,133 user-task scenarios from 1,500 profiles and 111 tasks across 20 domains.

Numbers122,133 scenarios; 1,500 profiles; 111 tasks; 20 domains

Practical UseUse this dataset to test assistants at scale across diverse tasks and realistic user histories.

Evidence RefSection 2; Table 2

Automated judge aligns well with human raters on task completion and coherence.

NumbersCohen's Kappa: TC 0.78; Coherence (A) 0.65

Practical UseYou can rely on the LLM-as-a-Judge to reduce expensive human annotation for TC and coherence checks.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TCR (T_SD) - Claude 3 Sonnet	95.98%	—	—	T_SD (single-domain)	High task completion for Claude 3 Sonnet on single-domain tasks	Table 3
Personalization (P) - Claude 3 Sonnet (T_SD)	2.13 / 4	—	—	T_SD	Low-to-moderate personalization despite high TCR	Table 3

What To Try In 7 Days

Run PersonaLens on your assistant on a small sample to measure current P vs TCR gaps.

Add retrieval of past interactions (or a simple conversation memory) and rerun the Base→Base+I ablation to estimate gains.

Use the judge prompts and evaluate a subset manually to confirm judge alignment with your users.

Agent Features

Memory

Past interaction summaries (queryable short-term memory)Situational context per task

Frameworks

LLM-as-a-JudgeSimulated user evaluation pipeline

Is Agentic

Yes

Architectures

LLM-powered user agentLLM-powered judge agent

Collaboration

User agent interacts with assistant; judge evaluates dialogue

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/amazon-science/PersonaLens

Data URLs

https://github.com/amazon-science/PersonaLens

Risks & Boundaries

Limitations

Profiles and dialogues are LLM-generated and may inherit LLM biases despite mitigation.

Only text-based interactions; no multimodal personalization (voice, images).

When Not To Use

For assessing real-world end-to-end action execution (payments, bookings) because actions are simulated.

For multimodal personalization tasks requiring audio or images.

Failure Modes

Judge bias: judge agent may reflect training data biases and over/under-rate personalization.

Synthetic-profile artifacts: generated preferences might not match true user behavior.

Core Entities

Models

Claude 3 HaikuClaude 3.5 HaikuClaude 3 SonnetClaude 3.5 SonnetLlama 3.1 8B InstructLlama 3.1 70B InstructMistral 7B InstructMixtral 8x7B Instruct

Metrics

Task Completion Rate (TCR)Personalization (1-4 scale)Naturalness (1-5)Coherence (1-5)

Datasets

PersonaLens (this paper)PRISM Alignment dataset

Benchmarks

PersonaChatLaMPMultiWOZSGDLAPS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large, multi-domain benchmark: 122,133 user-task scenarios from 1,500 profiles and 111 tasks across 20 domains.

Automated judge aligns well with human raters on task completion and coherence.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding