Clear taxonomy and practical survey of persona use in LLMs: role-playing vs personalization

June 3, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

3

Authors

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, Yun-Nung Chen

Links

Abstract / PDF

Why It Matters For Business

Personas let LLMs act like domain experts or tune results to users; use prompt personas to quickly prototype role-based workflows and invest in privacy-safe personalization for customer retention.

Summary TLDR

This survey organizes work that uses "persona" with large language models into two clear streams: (1) role-playing, where the model is given a persona or role to act as; and (2) personalization, where the model encodes and uses a user's persona to tailor outputs. It catalogs environments (software development, games, medical, evaluation), methods (prompt personas, multi-agent frameworks, retrieval for long histories), evaluation tools (Big Five, MBTI, MPI, LLM-as-evaluator), and open problems (long-context memory, dataset gaps, bias, safety, privacy). The paper provides a practical map, representative systems, and future directions.

Problem Statement

Research on using personas with LLMs is growing but fragmented. Practitioners lack a unifying taxonomy, a clear mapping from use cases to methods, and a concise view of evaluation practices and safety/privacy gaps.

Main Contribution

A two-part taxonomy: LLM Role-Playing (model has persona) vs LLM Personalization (model adapts to user persona).

A review of environments and methods: prompts, multi-agent agents, retrieval-memory, and fine-tuning approaches.

A summary of evaluation approaches for personality fidelity, including Big Five, MBTI, MPI, and LLM-as-evaluator.

A compact list of challenges and future directions: general frameworks, long-context personas, datasets, bias, safety, and privacy.

A maintained paper collection and code pointer for ongoing updates (GitHub).

Key Findings

The field splits into two distinct goals: role-playing and personalization.

Numbers2 research lines

Role-playing is often effective with prompt-based, training-free methods.

Multi-agent role-playing enables complex, collaborative workflows like software development and medical reasoning.

Personality evaluation commonly uses human psychometric tests (Big Five, MBTI) and specialized inventories (MPI).

Key engineering bottlenecks are long-context persona storage, lack of benchmarks/datasets, bias, safety, and privacy risks.

Using LLMs as evaluators is growing and can correlate better with humans than traditional metrics on some tasks.

Who Should Care

What To Try In 7 Days

Prototype a persona prompt for a concrete role (support agent, medical reviewer) and test outputs.

Build a simple 2–3 agent pipeline (planner + worker + reviewer) for a multi-step task.

Run a Big Five quick test on role-play outputs and compare to expected traits with a small human panel.

Agent Features

Memory

  • retrieval-based memory
  • short-term context summarization
  • long-term memory (via storage/summaries)

Planning

  • task decomposition
  • Waterfall-like phase pipelines
  • self-collaboration

Tool Use

  • retrieval memory
  • web navigation
  • external knowledge/tools

Frameworks

  • ChatDev
  • MetaGPT
  • AgentVerse
  • Voyager
  • MedAgent
  • DR-CoT
  • OPENCHA
  • MALP
  • HEALTHLLM

Is Agentic

true

Architectures

  • single-agent
  • multi-agent
  • agent pipeline

Collaboration

  • cooperative
  • adversarial
  • message pools

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Heterogeneous metrics across subfields make direct comparisons hard.
  • Many tasks lack standardized datasets or format-specific benchmarks.
  • Evaluations via human psychometrics may not directly transfer to LLMs.
  • Survey summarizes literature but does not run unified empirical comparisons.

When Not To Use

  • When legal privacy constraints forbid storing user persona data in prompts or memory.
  • When strict safety or non-toxicity guarantees are required without additional safeguards.
  • When you need a single, reproducible benchmarked model result (survey lacks unified benchmarks).

Failure Modes

  • Bias amplification when assigning demographic personas.
  • Increased toxicity or harmful outputs under certain persona prompts.
  • Jailbreaking via persona modulation and multi-agent coordination.
  • Personal data leakage via membership inference when storing personas.
  • Persona inconsistency across turns or sessions (unstable persona fidelity)

Core Entities

Models

  • ChatGPT
  • Voyager
  • MetaGPT
  • ChatDev
  • AgentVerse
  • DR-CoT
  • MedAgent
  • OPENCHA
  • MALP
  • HEALTHLLM

Metrics

  • Accuracy
  • task success rate
  • inform & success rate
  • Big Five (BFI)
  • MBTI
  • Machine Personality Inventory (MPI)

Datasets

  • WebShop
  • Mind2Web
  • WebArena
  • VisualWebArena
  • VisualWebBench
  • Amazon Review
  • MovieLens
  • Yelp
  • TripAdvisor
  • MIND
  • MultiWOZ
  • PersonaChat

Benchmarks

  • WebShop
  • Mind2Web
  • WebArena
  • VisualWebArena
  • VisualWebBench