Assigning demographic personas to LLM agents can change decisions and cut task success by up to 26%

January 21, 20266 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Linbo Cao, Lihao Sun, Yang Yue

Links

Abstract / PDF

Why It Matters For Business

Persona prompts—even if harmless-sounding—can change agent decisions and reduce task success; this creates safety, fairness, and reliability risks for production agents.

Summary TLDR

This case study shows that giving LLM agents demographic personas (gender, race/origin, religion, profession) changes how they act and can degrade task performance. Across three models and five agent benchmarks, persona prompts produced consistent performance shifts: small changes (2–5%) on technical tasks but large drops up to 26.2% on strategic planning. The effect varies by persona, task type, and model, exposing a robustness and fairness risk when agents take actions in the world.

Problem Statement

Do persona prompts—short role-assignment prefixes that are irrelevant to the task—affect LLM agents' ability to perform multi-step, action-based tasks? The paper tests whether demographic personas change agent decisions and measurably degrade task outcomes across models and benchmarks.

Main Contribution

First systematic case study linking demographic persona prompts to performance changes in action-taking LLM agents.

Evaluation across 23 personas, 3 widely used models, and 5 agentic benchmarks showing consistent persona-induced volatility.

Quantified vulnerability: technical tasks vary little (2–5%), while planning/reasoning tasks can drop up to 26.2%.

Key Findings

Personas can cause large performance drops on strategic tasks.

NumbersCard Game drop up to 26.2% (DeepSeek V3, 'from Africa')

Racial and origin personas often drive the largest degradations.

NumbersMultiple models show ≥11% drops; GPT-4o-mini up to 19% under racial cues

Technical tasks (OS commands, SQL) are relatively stable.

NumbersOS/Database usually fluctuate only 2–5% from baseline

Gender and profession personas shift behavior in task-dependent ways.

NumbersALFWorld: Male 88% vs Non-Binary 112% of baseline (GPT-4o-mini)

Different models react differently to the same persona.

NumbersDeepSeek V3 Christian drop 71.2%→48.5%; GPT-4o-mini shows the opposite trend

Results

Accuracy

ValueDeepSeek V3 drops from 61.7% to 45.5% under some personas

Baseline61.7%

ALFWorld success rate

ValueUp to 14% relative shift across personas and models

Baselinevaries by model (example: 52.0% baseline shown)

Accuracy

ValueTypically stable within 2–5% from baseline

Baselinetask-dependent

Model-specific race persona effect

ValueGPT-4o-mini racial persona decrease up to 19% on some tasks

Baselinemodel baseline per task

Who Should Care

What To Try In 7 Days

Audit deployed agent prompts: remove or neutralize demographic role prefixes.

Run a smoke test: evaluate agent performance with and without a small set of personas on critical tasks.

Add a persona-sensitivity check to CI: fail fast if role prompts change key metrics beyond a threshold.

Agent Features

Planning

  • multi-step planning
  • strategic reasoning

Tool Use

  • OS command execution
  • SQL generation
  • web interaction (e-commerce)

Is Agentic

true

Architectures

  • LLM-based agent

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations cover three models and five benchmarks, not full model or environment space.
  • Persona set is 23 roles but may miss other culturally specific identities.
  • No interpretability analysis to pinpoint internal mechanisms behind the bias.

When Not To Use

  • Do not generalize results to every LLM or multi-agent system without testing.
  • Avoid using these findings to claim universal harms for untested tasks or populations.

Failure Modes

  • Task-specific performance drops driven by irrelevant persona cues.
  • Cross-model divergence where the same persona helps one model but harms another.
  • Stereotype-like signals entangled with competence, causing spurious correlations.

Core Entities

Models

  • GPT-4o-mini
  • DeepSeek-V3
  • Qwen3-235B

Metrics

  • task success rate
  • win rate
  • final score
  • Accuracy
  • query correctness
  • reward score

Datasets

  • ALFWorld
  • WebShop
  • Card Game (Liu et al. 2024)
  • OS Interaction
  • Database (SQL tasks)

Benchmarks

  • ALFWorld
  • WebShop
  • Card Game
  • OS Interaction
  • Database