Overview
Empirical evidence across three models and five benchmarks shows persona prompts can reshuffle agent decisions; results are robust for the tested settings but not exhaustive across all agents or environments.
Citations0
Evidence Strength0.72
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Persona prompts—even if harmless-sounding—can change agent decisions and reduce task success; this creates safety, fairness, and reliability risks for production agents.
Who Should Care
Summary TLDR
This case study shows that giving LLM agents demographic personas (gender, race/origin, religion, profession) changes how they act and can degrade task performance. Across three models and five agent benchmarks, persona prompts produced consistent performance shifts: small changes (2–5%) on technical tasks but large drops up to 26.2% on strategic planning. The effect varies by persona, task type, and model, exposing a robustness and fairness risk when agents take actions in the world.
Problem Statement
Do persona prompts—short role-assignment prefixes that are irrelevant to the task—affect LLM agents' ability to perform multi-step, action-based tasks? The paper tests whether demographic personas change agent decisions and measurably degrade task outcomes across models and benchmarks.
Main Contribution
First systematic case study linking demographic persona prompts to performance changes in action-taking LLM agents.
Evaluation across 23 personas, 3 widely used models, and 5 agentic benchmarks showing consistent persona-induced volatility.
Key Findings
Personas can cause large performance drops on strategic tasks.
Racial and origin personas often drive the largest degradations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | DeepSeek V3 drops from 61.7% to 45.5% under some personas | 61.7% | −26.2% | Card Game | Table 2; Persona Category Analysis | Table 2 |
| ALFWorld success rate | Up to 14% relative shift across personas and models | varies by model (example: 52.0% baseline shown) | ±14% | ALFWorld | Results; Impact on Agent Robustness | Table 2 |
What To Try In 7 Days
Audit deployed agent prompts: remove or neutralize demographic role prefixes.
Run a smoke test: evaluate agent performance with and without a small set of personas on critical tasks.
Add a persona-sensitivity check to CI: fail fast if role prompts change key metrics beyond a threshold.
Agent Features
Planning
Tool Use
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Evaluations cover three models and five benchmarks, not full model or environment space.
Persona set is 23 roles but may miss other culturally specific identities.
When Not To Use
Do not generalize results to every LLM or multi-agent system without testing.
Avoid using these findings to claim universal harms for untested tasks or populations.
Failure Modes
Task-specific performance drops driven by irrelevant persona cues.
Cross-model divergence where the same persona helps one model but harms another.

