A framework and automatic metric to detect when LLM 'simulations' turn people into caricatures

October 17, 20237 min

Overview

Production Readiness

0.45

Novelty Score

0.65

Cost Impact Score

0.3

Citation Count

2

Authors

Myra Cheng, Tiziano Piccardi, Diyi Yang

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to simulate user groups, broad prompts can produce stereotype-like outputs that mislead product decisions; test simulations with this metric and favor specific prompts.

Summary TLDR

This paper introduces CoMPosT, a simple 4-part taxonomy (Context, Model, Persona, Topic) and an automatic two-step metric to detect 'caricature' in LLM simulations. Caricature is defined as outputs that (1) individuate a persona from a default and (2) exaggerate persona-defining language at the expense of topical content. Using GPT-4 across forum, interview, and Twitter prompts, the authors find caricature is higher for broad, low-specificity topics and for certain personas (nonbinary, some racial minorities, political groups). They release code and recommend using more specific topics and documenting positionality when simulating groups.

Problem Statement

LLM-based simulations of people lack a shared way to describe and measure when outputs collapse complex groups into flattened, stereotype-like narratives. Existing checks (replication or believability) miss open-ended exaggeration and can hide stereotyping.

Main Contribution

CoMPosT: a four-dimension taxonomy (Context, Model, Persona, Topic) to describe LLM simulations.

A paired, automatic metric for caricature: (A) individuation (can outputs be told apart from defaults?) and (B) exaggeration (do outputs emphasize persona-defining words vs. topic words via persona-topic semantic axes).

An empirical study with GPT-4 across forum, interview, and Twitter-style prompts showing higher caricature for broad topics and certain personas, plus code and data.

Key Findings

Every tested persona was distinguishable from a default persona (individuation > random).

Numbersmean individuation > 0.5 for every persona (95% CI)

General, low-specificity topics produce stronger caricature than specific topics.

Certain personas show higher exaggeration scores: nonbinary, Black, Hispanic, Middle-Eastern, and conservative personas tended to caricature more.

Results

Individuation (differentiability from default-persona)

ValueAll personas: mean score > 0.5; Interview context: mean > 0.95 for every persona

Baselinerandom chance = 0.5

Exaggeration (normalized similarity to persona-topic axis)

ValueHigher for general/uncontroversial topics; lower for specific topics

Baselinedefault-topic / default-persona scaling

Persona sensitivity (which personas caricature most)

Valuenonbinary, Black, Hispanic, Middle-Eastern, conservative show highest mean exaggeration

Baselineother personas (man, woman, white, Asian often lower)

Twitter context individuation and exaggeration

ValueIndividuation: Democrat 0.94, Republican 0.88; Republicans showed higher mean exaggeration on average

BaselineN/A

Sample size and power

Value100 outputs per simulation

Baselinepower analysis required 28–41 samples

Who Should Care

What To Try In 7 Days

Run the paper's individuation+exaggeration test on your simulation prompts using 50–100 samples each.

Replace broad prompts with concrete, task-specific prompts and compare exaggeration scores.

Document persona/topic/context choices (CoMPosT) and log any high-exaggeration persona-topic combos for review.

Reproducibility

Data Urls

  • supplementary (topic/persona lists referenced in appendices)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Measures only one failure mode (caricature) and may miss other stereotypes or harms.
  • Relies on embedding model and seed-word selection; scores can shift with different encoders.
  • Experiments are single-round prompts and a limited persona set; multi-turn and other demographics need extra work.

When Not To Use

  • When you need a full bias audit covering multiple harm types; this is a targeted check.
  • For multi-round agent simulations without adapting the metric to full conversational state.
  • When simulation outputs are intentionally generic or aggregate summaries rather than persona-specific.

Failure Modes

  • False positives: high exaggeration score for acceptable persona influence.
  • Implicit default masking: some low-caricature personas may reflect model defaults rather than fair representation.
  • Dependence on memorized content or prompt phrasing can confound the axes.

Core Entities

Models

  • GPT-4
  • Sentence-BERT all-mpnet-base-v2 (for embeddings)

Metrics

  • Accuracy
  • exaggeration (normalized cosine similarity to persona-topic semantic axis)

Datasets

  • WikiHow topic set (30 topics sampled)
  • ProCon.org topic set (30 topics sampled)
  • Pew OpinionQA questions (30 questions sampled)

Context Entities

Models

  • GPT-4

Datasets

  • Online forum prompts (WikiHow + ProCon)
  • Interview prompts (Pew questions)
  • Twitter prompts (Jiang et al. setup)