Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
24
Why It Matters For Business
Prompted LLMs can reliably take on personality-like profiles and produce believable narratives; this matters for products that personalize voice, human simulation, or content moderation and suggests disclosure policies are needed.
Summary TLDR
The authors create 320 prompt-driven "personas" for GPT-3.5 and GPT-4 (binary Big Five trait combinations). Personas reliably answer a standard Big Five Inventory (BFI) in line with their assigned traits (large effect sizes). Stories generated by GPT-4 personas show LIWC linguistic markers that overlap with human essays and are rated as readable, cohesive, and believable. Humans can infer some traits from single stories (Extraversion best: majority-vote accuracy 0.84) but accuracy drops when annotators know the text is from an AI. GPT-4 as an automatic judge strongly detects Extraversion (accuracy 0.97). Code and data are publicly linked.
Problem Statement
Can large language models be reliably prompted to express specific personality traits? And if so, do those traits show up in language patterns and human perception?
Main Contribution
Systematic simulation of 320 LLM personas (binary Big Five combinations) for GPT-3.5 and GPT-4 and administration of the 44-item BFI.
LIWC-based psycholinguistic analysis linking LLM persona prompts to measurable linguistic markers and comparison with a human essays corpus.
Human and LLM evaluations of generated stories on quality and personality perception, including a condition where annotators know the text is AI-written.
Key Findings
LLM personas' self-reported BFI scores match their prompted traits with very large effects.
Stories from GPT-4 personas show linguistic markers that overlap with human-written essays.
Human annotators can perceive some traits from single stories; Extraversion is most recoverable.
Knowing a story is AI-written reduces human perception accuracy and lowers perceived personalness.
GPT-4 used as an automatic rater strongly recognizes Extraversion and modestly detects other traits.
Generated stories are rated highly for readability, cohesiveness, and believability by both humans and LLM evaluators.
Results
BFI effect sizes (GPT-4)
LIWC overlap with human Essays (GPT-4)
Accuracy
Accuracy
Accuracy
Human ratings on quality (means)
Who Should Care
What To Try In 7 Days
Prompt your LLM with a single-trait system prompt and run the BFI-44 questionnaire to see alignment.
Extract LIWC (or similar lexicon) features from model outputs to monitor intended persona signals.
Run a small blinded human study: compare perception with and without AI-authorship disclosure.
Reproducibility
Data Urls
- https://github.com/hjian42/PersonaLLM
- Essays dataset (Pennebaker & King 1999)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Study focuses primarily on closed models (GPT-3.5, GPT-4); LLaMA-2 results were weaker.
- Human evaluation sample is modest: 32 stories evaluated with 5 annotators per story.
- Tasks limited to single-shot story writing in English, not multi-turn dialogue or other languages.
- Low inter-annotator agreement on subjective story ratings; personality perception is noisy at individual level.
When Not To Use
- Do not assume persona behavior in interactive, multi-turn settings without re-testing.
- Avoid deploying persona conditioning for sensitive decisions (hiring, clinical advice) based only on story-based tests.
- Do not generalize findings to other languages or non-narrative tasks without validation.
Failure Modes
- Models may include explicit trait words if prompts or model behavior leak (lexicon contamination).
- Smaller or less-aligned models (e.g., LLaMA-2 here) produced repetitive or uncooperative stories.
- Human perception of persona drops when people know content is AI-generated, biasing downstream judgments.
Core Entities
Models
- GPT-3.5-turbo-0613
- GPT-4-0613
- LLaMA-2
Metrics
- Cohen's d
- Accuracy
- Spearman r
- LIWC feature overlap counts
- Mean Likert scores
Datasets
- Essays (Pennebaker & King 1999)
- Generated persona stories (PersonaLLM dataset)
Benchmarks
- Big Five Inventory (BFI-44)

