Do LLMs write like a personality? GPT-3.5 and GPT-4 can be prompted to express Big Five traits and people often recognize them

May 4, 20238 min

Overview

Decision SnapshotNeeds Validation

The study gives solid within-experiment evidence (large effect sizes and LIWC overlaps) but focuses on closed models (GPT) and storytelling only; results generalize cautiously beyond similar setups.

Citations24

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Prompted LLMs can reliably take on personality-like profiles and produce believable narratives; this matters for products that personalize voice, human simulation, or content moderation and suggests disclosure policies are needed.

Who Should Care

Summary TLDR

The authors create 320 prompt-driven "personas" for GPT-3.5 and GPT-4 (binary Big Five trait combinations). Personas reliably answer a standard Big Five Inventory (BFI) in line with their assigned traits (large effect sizes). Stories generated by GPT-4 personas show LIWC linguistic markers that overlap with human essays and are rated as readable, cohesive, and believable. Humans can infer some traits from single stories (Extraversion best: majority-vote accuracy 0.84) but accuracy drops when annotators know the text is from an AI. GPT-4 as an automatic judge strongly detects Extraversion (accuracy 0.97). Code and data are publicly linked.

Problem Statement

Can large language models be reliably prompted to express specific personality traits? And if so, do those traits show up in language patterns and human perception?

Main Contribution

Systematic simulation of 320 LLM personas (binary Big Five combinations) for GPT-3.5 and GPT-4 and administration of the 44-item BFI.

LIWC-based psycholinguistic analysis linking LLM persona prompts to measurable linguistic markers and comparison with a human essays corpus.

Key Findings

LLM personas' self-reported BFI scores match their prompted traits with very large effects.

NumbersGPT-4 Cohen's d: EXT 5.47; AGR 4.22; CON 4.39; NEU 5.17; OPN 6.30 (p<.001)

Practical UsePrompting an LLM with a trait label produces reliably different questionnaire answers; use prompts to shape persona-level BFI-style behavior but validate downstream tasks.

Evidence RefSection 3.1, Fig.2

Stories from GPT-4 personas show linguistic markers that overlap with human-written essays.

NumbersGPT-4 overlap with human LIWC features: Conscientiousness 11/31, Openness 17/36 (Table 1)

Practical UseLLM-produced text contains recognizable trait-linked word patterns; you can use lexicon features (e.g., LIWC) to monitor persona expressivity.

Evidence RefSection 3.2, Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BFI effect sizes (GPT-4)EXT d=5.47; AGR d=4.22; CON d=4.39; NEU d=5.17; OPN d=6.30320 persona BFI responsesSection 3.1, Fig.2Fig.2
LIWC overlap with human Essays (GPT-4)Conscientiousness 11/31; Openness 17/36 overlapping significant featuresGPT-3.5: Conscientiousness 1/31; Openness 2/36LIWC on GPT persona stories vs Essays datasetSection 3.2, Table 1Table 1

What To Try In 7 Days

Prompt your LLM with a single-trait system prompt and run the BFI-44 questionnaire to see alignment.

Extract LIWC (or similar lexicon) features from model outputs to monitor intended persona signals.

Run a small blinded human study: compare perception with and without AI-authorship disclosure.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/hjian42/PersonaLLMEssays dataset (Pennebaker & King 1999)

Risks & Boundaries

Limitations

Study focuses primarily on closed models (GPT-3.5, GPT-4); LLaMA-2 results were weaker.

Human evaluation sample is modest: 32 stories evaluated with 5 annotators per story.

When Not To Use

Do not assume persona behavior in interactive, multi-turn settings without re-testing.

Avoid deploying persona conditioning for sensitive decisions (hiring, clinical advice) based only on story-based tests.

Failure Modes

Models may include explicit trait words if prompts or model behavior leak (lexicon contamination).

Smaller or less-aligned models (e.g., LLaMA-2 here) produced repetitive or uncooperative stories.

Core Entities

Models

GPT-3.5-turbo-0613GPT-4-0613LLaMA-2

Metrics

Cohen's dAccuracySpearman rLIWC feature overlap countsMean Likert scores

Datasets

Essays (Pennebaker & King 1999)Generated persona stories (PersonaLLM dataset)

Benchmarks

Big Five Inventory (BFI-44)