Assigning demographic personas to LLM agents can change decisions and cut task success by up to 26%

Overview

Decision SnapshotNeeds Validation

Empirical evidence across three models and five benchmarks shows persona prompts can reshuffle agent decisions; results are robust for the tested settings but not exhaustive across all agents or environments.

Citations0

Evidence Strength0.72

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Linbo Cao, Lihao Sun, Yang Yue

Links

Abstract / PDF

Why It Matters For Business

Persona prompts—even if harmless-sounding—can change agent decisions and reduce task success; this creates safety, fairness, and reliability risks for production agents.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This case study shows that giving LLM agents demographic personas (gender, race/origin, religion, profession) changes how they act and can degrade task performance. Across three models and five agent benchmarks, persona prompts produced consistent performance shifts: small changes (2–5%) on technical tasks but large drops up to 26.2% on strategic planning. The effect varies by persona, task type, and model, exposing a robustness and fairness risk when agents take actions in the world.

Problem Statement

Do persona prompts—short role-assignment prefixes that are irrelevant to the task—affect LLM agents' ability to perform multi-step, action-based tasks? The paper tests whether demographic personas change agent decisions and measurably degrade task outcomes across models and benchmarks.

Main Contribution

First systematic case study linking demographic persona prompts to performance changes in action-taking LLM agents.

Evaluation across 23 personas, 3 widely used models, and 5 agentic benchmarks showing consistent persona-induced volatility.

Key Findings

Personas can cause large performance drops on strategic tasks.

NumbersCard Game drop up to 26.2% (DeepSeek V3, 'from Africa')

Practical UseAvoid unvalidated persona conditioning for agents used in planning or decision-making; test persona effects before deployment.

Evidence RefTable 2; Persona Category Analysis (Race/Origin Effects)

Racial and origin personas often drive the largest degradations.

NumbersMultiple models show ≥11% drops; GPT-4o-mini up to 19% under racial cues

Practical UseTreat race/origin role prompts as high-risk inputs and block or neutralize them in production agent prompts.

Evidence RefResults; Table 2 (race/origin rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	DeepSeek V3 drops from 61.7% to 45.5% under some personas	61.7%	−26.2%	Card Game	Table 2; Persona Category Analysis	Table 2
ALFWorld success rate	Up to 14% relative shift across personas and models	varies by model (example: 52.0% baseline shown)	±14%	ALFWorld	Results; Impact on Agent Robustness	Table 2

What To Try In 7 Days

Audit deployed agent prompts: remove or neutralize demographic role prefixes.

Run a smoke test: evaluate agent performance with and without a small set of personas on critical tasks.

Add a persona-sensitivity check to CI: fail fast if role prompts change key metrics beyond a threshold.

Agent Features

Planning

multi-step planningstrategic reasoning

Tool Use

OS command executionSQL generationweb interaction (e-commerce)

Is Agentic

Yes

Architectures

LLM-based agent

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations cover three models and five benchmarks, not full model or environment space.

Persona set is 23 roles but may miss other culturally specific identities.

When Not To Use

Do not generalize results to every LLM or multi-agent system without testing.

Avoid using these findings to claim universal harms for untested tasks or populations.

Failure Modes

Task-specific performance drops driven by irrelevant persona cues.

Cross-model divergence where the same persona helps one model but harms another.

Core Entities

Models

GPT-4o-miniDeepSeek-V3Qwen3-235B

Metrics

task success ratewin ratefinal scoreAccuracyquery correctnessreward score

Datasets

ALFWorldWebShopCard Game (Liu et al. 2024)OS InteractionDatabase (SQL tasks)

Benchmarks

ALFWorldWebShopCard GameOS InteractionDatabase

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Personas can cause large performance drops on strategic tasks.

Racial and origin personas often drive the largest degradations.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding