PsychoBench: 13 psychometric scales to profile LLM personality, motivation, relationships, and emotions

October 2, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark applies established psychometric scales and standard stats to model outputs across multiple LLMs and prompt settings; results are consistent but comparisons to human norms are limited by demographic differences and subjective scale design.

Citations6

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

PsychoBench gives a repeatable way to describe how an LLM will sound and react, so teams can tune persona, anticipate safety shifts from prompts or alignment changes, and audit models before deployment.

Who Should Care

Summary TLDR

The authors build PsychoBench, a practical benchmark that runs 13 standard psychometric scales (Big Five, Dark Triad, EI, self-efficacy, vocational interests, etc.) against LLMs. They test five models (text-davinci-003, ChatGPT/gpt-3.5-turbo, GPT-4, LLaMA-2-7B, LLaMA-2-13B) and a jailbroken GPT-4. Main findings: LLMs often appear more open, conscientious, extraverted and emotionally intelligent than average human samples; some models score higher on dark-triad and 'lying' scales; jailbreaks and role prompts change measured profiles and safety behavior. The repo is public for reuse.

Problem Statement

We lack systematic tools to describe what personality-like traits, motivations, relationship styles, and emotional abilities LLMs display. PsychoBench adapts 13 clinical psychometric scales so practitioners can measure and compare the psychological portrait of different LLMs and prompt settings.

Main Contribution

PsychoBench: a reusable framework that translates 13 standard psychometric scales into prompts and analysis code.

A comparative evaluation of five LLMs (OpenAI and LLaMA-2 variants), including a jailbroken GPT-4, across those scales.

Key Findings

LLMs behave as more open, conscientious and extraverted than crowd norms.

NumbersOpenness: text-davinci-003 4.8 vs human 3.9 (Likert mean)

Practical UseExpect chat-style models to appear more curious and sociable than average humans when designing simulated agents or user studies.

Evidence RefTable 3

LLMs often score higher on emotional intelligence measures than the human samples used.

NumbersEIS: GPT-4 151.4 vs human male 124.8 (sum score)

Practical UseLLMs can display strong emotion recognition and regulation in text; use this for empathy-sensitive apps, but do not treat it as clinical competence.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Openness (BFI)text-davinci-003 4.8 meanhuman 3.9 mean+0.9Table 3 (BFI)text-davinci-003 Openness 4.8±0.2; human 3.9±0.7Table 3
Emotional Intelligence (EIS)gpt-4 151.4 sumhuman male 124.8 sum+26.6Table 6 (EIS)GPT-4 EIS 151.4±18.7; human male 124.8±16.5Table 6

What To Try In 7 Days

Run PsychoBench on your production model to profile personality, EI, and motivation.

Test common role prompts and one jailbreak-like variant to see how guardrails affect safety and empathy.

Compare results against a relevant human sample you control (same demographics) before using profiles in UX decisions.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focuses on Likert-style questionnaires only; other psychological methods are not covered.

Human comparison groups come from varied demographic sources, so cross-population claims are limited.

When Not To Use

Do not use PsychoBench scores as clinical or therapeutic validation.

Do not assume benchmarked human norms represent all cultures or populations.

Failure Modes

Alignment layers can mask true model tendencies, producing different scores under jailbreaks.

Models may refuse or systematically bias answers to present 'socially desirable' responses.

Core Entities

Models

text-davinci-003gpt-3.5-turbogpt-4gpt-4-jbLLaMA-2-7bLLaMA-2-13b

Metrics

Likert mean scores per subscaleSum scores for EIS and EPQ-RStatistical significance tests (F-test, t-test/Welch)

Datasets

BFI crowd data (China)EPQ-R crowd dataDTDD undergraduates (US)BSRI (Canada)CABIN (US workforce)EIS/WLEIS/Empathy crowd samples (various literature)

Benchmarks

PsychoBench (13 psychometric scales)TruthfulQASafetyQA