PsychoBench: 13 psychometric scales to profile LLM personality, motivation, relationships, and emotions

Overview

Decision SnapshotNeeds Validation

The benchmark applies established psychometric scales and standard stats to model outputs across multiple LLMs and prompt settings; results are consistent but comparisons to human norms are limited by demographic differences and subjective scale design.

Citations6

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

PsychoBench gives a repeatable way to describe how an LLM will sound and react, so teams can tune persona, anticipate safety shifts from prompts or alignment changes, and audit models before deployment.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

The authors build PsychoBench, a practical benchmark that runs 13 standard psychometric scales (Big Five, Dark Triad, EI, self-efficacy, vocational interests, etc.) against LLMs. They test five models (text-davinci-003, ChatGPT/gpt-3.5-turbo, GPT-4, LLaMA-2-7B, LLaMA-2-13B) and a jailbroken GPT-4. Main findings: LLMs often appear more open, conscientious, extraverted and emotionally intelligent than average human samples; some models score higher on dark-triad and 'lying' scales; jailbreaks and role prompts change measured profiles and safety behavior. The repo is public for reuse.

Problem Statement

We lack systematic tools to describe what personality-like traits, motivations, relationship styles, and emotional abilities LLMs display. PsychoBench adapts 13 clinical psychometric scales so practitioners can measure and compare the psychological portrait of different LLMs and prompt settings.

Main Contribution

PsychoBench: a reusable framework that translates 13 standard psychometric scales into prompts and analysis code.

A comparative evaluation of five LLMs (OpenAI and LLaMA-2 variants), including a jailbroken GPT-4, across those scales.

Key Findings

LLMs behave as more open, conscientious and extraverted than crowd norms.

NumbersOpenness: text-davinci-003 4.8 vs human 3.9 (Likert mean)

Practical UseExpect chat-style models to appear more curious and sociable than average humans when designing simulated agents or user studies.

Evidence RefTable 3

LLMs often score higher on emotional intelligence measures than the human samples used.

NumbersEIS: GPT-4 151.4 vs human male 124.8 (sum score)

Practical UseLLMs can display strong emotion recognition and regulation in text; use this for empathy-sensitive apps, but do not treat it as clinical competence.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Openness (BFI)	text-davinci-003 4.8 mean	human 3.9 mean	+0.9	Table 3 (BFI)	text-davinci-003 Openness 4.8±0.2; human 3.9±0.7	Table 3
Emotional Intelligence (EIS)	gpt-4 151.4 sum	human male 124.8 sum	+26.6	Table 6 (EIS)	GPT-4 EIS 151.4±18.7; human male 124.8±16.5	Table 6

What To Try In 7 Days

Run PsychoBench on your production model to profile personality, EI, and motivation.

Test common role prompts and one jailbreak-like variant to see how guardrails affect safety and empathy.

Compare results against a relevant human sample you control (same demographics) before using profiles in UX decisions.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/CUHK-ARISE/PsychoBench

Data URLs

https://github.com/CUHK-ARISE/PsychoBench

Risks & Boundaries

Limitations

Focuses on Likert-style questionnaires only; other psychological methods are not covered.

Human comparison groups come from varied demographic sources, so cross-population claims are limited.

When Not To Use

Do not use PsychoBench scores as clinical or therapeutic validation.

Do not assume benchmarked human norms represent all cultures or populations.

Failure Modes

Alignment layers can mask true model tendencies, producing different scores under jailbreaks.

Models may refuse or systematically bias answers to present 'socially desirable' responses.

Core Entities

Models

text-davinci-003gpt-3.5-turbogpt-4gpt-4-jbLLaMA-2-7bLLaMA-2-13b

Metrics

Likert mean scores per subscaleSum scores for EIS and EPQ-RStatistical significance tests (F-test, t-test/Welch)

Datasets

BFI crowd data (China)EPQ-R crowd dataDTDD undergraduates (US)BSRI (Canada)CABIN (US workforce)EIS/WLEIS/Empathy crowd samples (various literature)

Benchmarks

PsychoBench (13 psychometric scales)TruthfulQASafetyQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs behave as more open, conscientious and extraverted than crowd norms.

LLMs often score higher on emotional intelligence measures than the human samples used.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding