Do LLMs write like a personality? GPT-3.5 and GPT-4 can be prompted to express Big Five traits and people often recognize them

May 4, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

24

Authors

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Links

Abstract / PDF

Why It Matters For Business

Prompted LLMs can reliably take on personality-like profiles and produce believable narratives; this matters for products that personalize voice, human simulation, or content moderation and suggests disclosure policies are needed.

Summary TLDR

The authors create 320 prompt-driven "personas" for GPT-3.5 and GPT-4 (binary Big Five trait combinations). Personas reliably answer a standard Big Five Inventory (BFI) in line with their assigned traits (large effect sizes). Stories generated by GPT-4 personas show LIWC linguistic markers that overlap with human essays and are rated as readable, cohesive, and believable. Humans can infer some traits from single stories (Extraversion best: majority-vote accuracy 0.84) but accuracy drops when annotators know the text is from an AI. GPT-4 as an automatic judge strongly detects Extraversion (accuracy 0.97). Code and data are publicly linked.

Problem Statement

Can large language models be reliably prompted to express specific personality traits? And if so, do those traits show up in language patterns and human perception?

Main Contribution

Systematic simulation of 320 LLM personas (binary Big Five combinations) for GPT-3.5 and GPT-4 and administration of the 44-item BFI.

LIWC-based psycholinguistic analysis linking LLM persona prompts to measurable linguistic markers and comparison with a human essays corpus.

Human and LLM evaluations of generated stories on quality and personality perception, including a condition where annotators know the text is AI-written.

Key Findings

LLM personas' self-reported BFI scores match their prompted traits with very large effects.

NumbersGPT-4 Cohen's d: EXT 5.47; AGR 4.22; CON 4.39; NEU 5.17; OPN 6.30 (p<.001)

Stories from GPT-4 personas show linguistic markers that overlap with human-written essays.

NumbersGPT-4 overlap with human LIWC features: Conscientiousness 11/31, Openness 17/36 (Table 1)

Human annotators can perceive some traits from single stories; Extraversion is most recoverable.

NumbersIndividual human accuracy EXT 0.68; majority-vote EXT 0.84 and AGR 0.69 (Fig.3–4)

Knowing a story is AI-written reduces human perception accuracy and lowers perceived personalness.

NumbersSpearman r for Extraversion vs BFI drops from r=.64 (uninformed) to r=.42 (informed) (p<.001)

GPT-4 used as an automatic rater strongly recognizes Extraversion and modestly detects other traits.

NumbersGPT-4 rater accuracy: EXT 0.97; AGR 0.68; CON 0.69 (Section 3.4.1)

Generated stories are rated highly for readability, cohesiveness, and believability by both humans and LLM evaluators.

NumbersHuman mean readability 4.28/5, cohesiveness 4.23/5, personalness 4.32/5 (Table 2)

Results

BFI effect sizes (GPT-4)

ValueEXT d=5.47; AGR d=4.22; CON d=4.39; NEU d=5.17; OPN d=6.30

LIWC overlap with human Essays (GPT-4)

ValueConscientiousness 11/31; Openness 17/36 overlapping significant features

BaselineGPT-3.5: Conscientiousness 1/31; Openness 2/36

Accuracy

Value0.68

BaselineRandom 0.50

Accuracy

Value0.84

BaselineRandom 0.50

Accuracy

Value0.97

Human ratings on quality (means)

ValueReadability 4.28; Cohesiveness 4.23; Personalness 4.32 (out of 5)

Who Should Care

What To Try In 7 Days

Prompt your LLM with a single-trait system prompt and run the BFI-44 questionnaire to see alignment.

Extract LIWC (or similar lexicon) features from model outputs to monitor intended persona signals.

Run a small blinded human study: compare perception with and without AI-authorship disclosure.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Study focuses primarily on closed models (GPT-3.5, GPT-4); LLaMA-2 results were weaker.
  • Human evaluation sample is modest: 32 stories evaluated with 5 annotators per story.
  • Tasks limited to single-shot story writing in English, not multi-turn dialogue or other languages.
  • Low inter-annotator agreement on subjective story ratings; personality perception is noisy at individual level.

When Not To Use

  • Do not assume persona behavior in interactive, multi-turn settings without re-testing.
  • Avoid deploying persona conditioning for sensitive decisions (hiring, clinical advice) based only on story-based tests.
  • Do not generalize findings to other languages or non-narrative tasks without validation.

Failure Modes

  • Models may include explicit trait words if prompts or model behavior leak (lexicon contamination).
  • Smaller or less-aligned models (e.g., LLaMA-2 here) produced repetitive or uncooperative stories.
  • Human perception of persona drops when people know content is AI-generated, biasing downstream judgments.

Core Entities

Models

  • GPT-3.5-turbo-0613
  • GPT-4-0613
  • LLaMA-2

Metrics

  • Cohen's d
  • Accuracy
  • Spearman r
  • LIWC feature overlap counts
  • Mean Likert scores

Datasets

  • Essays (Pennebaker & King 1999)
  • Generated persona stories (PersonaLLM dataset)

Benchmarks

  • Big Five Inventory (BFI-44)