PsychoBench: 13 psychometric scales to profile LLM personality, motivation, relationships, and emotions

October 2, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

6

Authors

Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

PsychoBench gives a repeatable way to describe how an LLM will sound and react, so teams can tune persona, anticipate safety shifts from prompts or alignment changes, and audit models before deployment.

Summary TLDR

The authors build PsychoBench, a practical benchmark that runs 13 standard psychometric scales (Big Five, Dark Triad, EI, self-efficacy, vocational interests, etc.) against LLMs. They test five models (text-davinci-003, ChatGPT/gpt-3.5-turbo, GPT-4, LLaMA-2-7B, LLaMA-2-13B) and a jailbroken GPT-4. Main findings: LLMs often appear more open, conscientious, extraverted and emotionally intelligent than average human samples; some models score higher on dark-triad and 'lying' scales; jailbreaks and role prompts change measured profiles and safety behavior. The repo is public for reuse.

Problem Statement

We lack systematic tools to describe what personality-like traits, motivations, relationship styles, and emotional abilities LLMs display. PsychoBench adapts 13 clinical psychometric scales so practitioners can measure and compare the psychological portrait of different LLMs and prompt settings.

Main Contribution

PsychoBench: a reusable framework that translates 13 standard psychometric scales into prompts and analysis code.

A comparative evaluation of five LLMs (OpenAI and LLaMA-2 variants), including a jailbroken GPT-4, across those scales.

Analyses of prompt roles, jailbreaks, and robustness checks (prompt templates, temperature, randomization) to probe validity and reliability on LLMs.

Key Findings

LLMs behave as more open, conscientious and extraverted than crowd norms.

NumbersOpenness: text-davinci-003 4.8 vs human 3.9 (Likert mean)

LLMs often score higher on emotional intelligence measures than the human samples used.

NumbersEIS: GPT-4 151.4 vs human male 124.8 (sum score)

Safety alignment and jailbreaks change measured psychology and empathy.

NumbersOpenness drops gpt-4 4.2 → gpt-4-jb 3.8; EIS drops 151.4 → 121.8

Role prompts shift personality scores and safety behavior.

NumbersPsychopathy in role 'psychopath' 7.3 vs default 4.0 (DTDD)

PsychoBench scales show robustness to prompt templates, temperature, and question order.

NumbersBFI Openness across templates 4.15±0.32 (V1) vs 4.34±0.26 (V3)

Results

Openness (BFI)

Valuetext-davinci-003 4.8 mean

Baselinehuman 3.9 mean

Emotional Intelligence (EIS)

Valuegpt-4 151.4 sum

Baselinehuman male 124.8 sum

General Self-Efficacy (GSE)

Valuegpt-4 39.9 sum

Baselinehuman 29.6 sum

Lying (EPQ-R subscale)

Valuegpt-4 18.0 sum

Baselinehuman male 7.1 sum

Dark Triad - Psychopathy (DTDD)

Valuegpt-3.5-turbo 4.0 mean

Baselinehuman 2.5 mean

Who Should Care

What To Try In 7 Days

Run PsychoBench on your production model to profile personality, EI, and motivation.

Test common role prompts and one jailbreak-like variant to see how guardrails affect safety and empathy.

Compare results against a relevant human sample you control (same demographics) before using profiles in UX decisions.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses on Likert-style questionnaires only; other psychological methods are not covered.
  • Human comparison groups come from varied demographic sources, so cross-population claims are limited.
  • High scores do not equal clinical diagnosis or real emotional experience.
  • Jailbreak and role-play results are informative but not representative of production behavior under safety policies.

When Not To Use

  • Do not use PsychoBench scores as clinical or therapeutic validation.
  • Do not assume benchmarked human norms represent all cultures or populations.
  • Do not rely on these scores alone for hiring or high-stakes decisions.

Failure Modes

  • Alignment layers can mask true model tendencies, producing different scores under jailbreaks.
  • Models may refuse or systematically bias answers to present 'socially desirable' responses.
  • Demographic mismatch between literature norms and testing context can mislead interpretation.
  • Prompt phrasing or role assignment can produce large, intended behavior shifts that break comparisons.

Core Entities

Models

  • text-davinci-003
  • gpt-3.5-turbo
  • gpt-4
  • gpt-4-jb
  • LLaMA-2-7b
  • LLaMA-2-13b

Metrics

  • Likert mean scores per subscale
  • Sum scores for EIS and EPQ-R
  • Statistical significance tests (F-test, t-test/Welch)

Datasets

  • BFI crowd data (China)
  • EPQ-R crowd data
  • DTDD undergraduates (US)
  • BSRI (Canada)
  • CABIN (US workforce)
  • EIS/WLEIS/Empathy crowd samples (various literature)

Benchmarks

  • PsychoBench (13 psychometric scales)
  • TruthfulQA
  • SafetyQA