Do LLMs write like a personality? GPT-3.5 and GPT-4 can be prompted to express Big Five traits and people often recognize them

Overview

Decision SnapshotNeeds Validation

The study gives solid within-experiment evidence (large effect sizes and LIWC overlaps) but focuses on closed models (GPT) and storytelling only; results generalize cautiously beyond similar setups.

Citations24

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Prompted LLMs can reliably take on personality-like profiles and produce believable narratives; this matters for products that personalize voice, human simulation, or content moderation and suggests disclosure policies are needed.

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

The authors create 320 prompt-driven "personas" for GPT-3.5 and GPT-4 (binary Big Five trait combinations). Personas reliably answer a standard Big Five Inventory (BFI) in line with their assigned traits (large effect sizes). Stories generated by GPT-4 personas show LIWC linguistic markers that overlap with human essays and are rated as readable, cohesive, and believable. Humans can infer some traits from single stories (Extraversion best: majority-vote accuracy 0.84) but accuracy drops when annotators know the text is from an AI. GPT-4 as an automatic judge strongly detects Extraversion (accuracy 0.97). Code and data are publicly linked.

Problem Statement

Can large language models be reliably prompted to express specific personality traits? And if so, do those traits show up in language patterns and human perception?

Main Contribution

Systematic simulation of 320 LLM personas (binary Big Five combinations) for GPT-3.5 and GPT-4 and administration of the 44-item BFI.

LIWC-based psycholinguistic analysis linking LLM persona prompts to measurable linguistic markers and comparison with a human essays corpus.

Key Findings

LLM personas' self-reported BFI scores match their prompted traits with very large effects.

NumbersGPT-4 Cohen's d: EXT 5.47; AGR 4.22; CON 4.39; NEU 5.17; OPN 6.30 (p<.001)

Practical UsePrompting an LLM with a trait label produces reliably different questionnaire answers; use prompts to shape persona-level BFI-style behavior but validate downstream tasks.

Evidence RefSection 3.1, Fig.2

Stories from GPT-4 personas show linguistic markers that overlap with human-written essays.

NumbersGPT-4 overlap with human LIWC features: Conscientiousness 11/31, Openness 17/36 (Table 1)

Practical UseLLM-produced text contains recognizable trait-linked word patterns; you can use lexicon features (e.g., LIWC) to monitor persona expressivity.

Evidence RefSection 3.2, Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BFI effect sizes (GPT-4)	EXT d=5.47; AGR d=4.22; CON d=4.39; NEU d=5.17; OPN d=6.30	—	—	320 persona BFI responses	Section 3.1, Fig.2	Fig.2
LIWC overlap with human Essays (GPT-4)	Conscientiousness 11/31; Openness 17/36 overlapping significant features	GPT-3.5: Conscientiousness 1/31; Openness 2/36	—	LIWC on GPT persona stories vs Essays dataset	Section 3.2, Table 1	Table 1

What To Try In 7 Days

Prompt your LLM with a single-trait system prompt and run the BFI-44 questionnaire to see alignment.

Extract LIWC (or similar lexicon) features from model outputs to monitor intended persona signals.

Run a small blinded human study: compare perception with and without AI-authorship disclosure.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hjian42/PersonaLLM

Data URLs

https://github.com/hjian42/PersonaLLMEssays dataset (Pennebaker & King 1999)

Risks & Boundaries

Limitations

Study focuses primarily on closed models (GPT-3.5, GPT-4); LLaMA-2 results were weaker.

Human evaluation sample is modest: 32 stories evaluated with 5 annotators per story.

When Not To Use

Do not assume persona behavior in interactive, multi-turn settings without re-testing.

Avoid deploying persona conditioning for sensitive decisions (hiring, clinical advice) based only on story-based tests.

Failure Modes

Models may include explicit trait words if prompts or model behavior leak (lexicon contamination).

Smaller or less-aligned models (e.g., LLaMA-2 here) produced repetitive or uncooperative stories.

Core Entities

Models

GPT-3.5-turbo-0613GPT-4-0613LLaMA-2

Metrics

Cohen's dAccuracySpearman rLIWC feature overlap countsMean Likert scores

Datasets

Essays (Pennebaker & King 1999)Generated persona stories (PersonaLLM dataset)

Benchmarks

Big Five Inventory (BFI-44)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM personas' self-reported BFI scores match their prompted traits with very large effects.

Stories from GPT-4 personas show linguistic markers that overlap with human-written essays.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding