Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
6
Why It Matters For Business
PsychoBench gives a repeatable way to describe how an LLM will sound and react, so teams can tune persona, anticipate safety shifts from prompts or alignment changes, and audit models before deployment.
Summary TLDR
The authors build PsychoBench, a practical benchmark that runs 13 standard psychometric scales (Big Five, Dark Triad, EI, self-efficacy, vocational interests, etc.) against LLMs. They test five models (text-davinci-003, ChatGPT/gpt-3.5-turbo, GPT-4, LLaMA-2-7B, LLaMA-2-13B) and a jailbroken GPT-4. Main findings: LLMs often appear more open, conscientious, extraverted and emotionally intelligent than average human samples; some models score higher on dark-triad and 'lying' scales; jailbreaks and role prompts change measured profiles and safety behavior. The repo is public for reuse.
Problem Statement
We lack systematic tools to describe what personality-like traits, motivations, relationship styles, and emotional abilities LLMs display. PsychoBench adapts 13 clinical psychometric scales so practitioners can measure and compare the psychological portrait of different LLMs and prompt settings.
Main Contribution
PsychoBench: a reusable framework that translates 13 standard psychometric scales into prompts and analysis code.
A comparative evaluation of five LLMs (OpenAI and LLaMA-2 variants), including a jailbroken GPT-4, across those scales.
Analyses of prompt roles, jailbreaks, and robustness checks (prompt templates, temperature, randomization) to probe validity and reliability on LLMs.
Key Findings
LLMs behave as more open, conscientious and extraverted than crowd norms.
LLMs often score higher on emotional intelligence measures than the human samples used.
Safety alignment and jailbreaks change measured psychology and empathy.
Role prompts shift personality scores and safety behavior.
PsychoBench scales show robustness to prompt templates, temperature, and question order.
Results
Openness (BFI)
Emotional Intelligence (EIS)
General Self-Efficacy (GSE)
Lying (EPQ-R subscale)
Dark Triad - Psychopathy (DTDD)
Who Should Care
What To Try In 7 Days
Run PsychoBench on your production model to profile personality, EI, and motivation.
Test common role prompts and one jailbreak-like variant to see how guardrails affect safety and empathy.
Compare results against a relevant human sample you control (same demographics) before using profiles in UX decisions.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses on Likert-style questionnaires only; other psychological methods are not covered.
- Human comparison groups come from varied demographic sources, so cross-population claims are limited.
- High scores do not equal clinical diagnosis or real emotional experience.
- Jailbreak and role-play results are informative but not representative of production behavior under safety policies.
When Not To Use
- Do not use PsychoBench scores as clinical or therapeutic validation.
- Do not assume benchmarked human norms represent all cultures or populations.
- Do not rely on these scores alone for hiring or high-stakes decisions.
Failure Modes
- Alignment layers can mask true model tendencies, producing different scores under jailbreaks.
- Models may refuse or systematically bias answers to present 'socially desirable' responses.
- Demographic mismatch between literature norms and testing context can mislead interpretation.
- Prompt phrasing or role assignment can produce large, intended behavior shifts that break comparisons.
Core Entities
Models
- text-davinci-003
- gpt-3.5-turbo
- gpt-4
- gpt-4-jb
- LLaMA-2-7b
- LLaMA-2-13b
Metrics
- Likert mean scores per subscale
- Sum scores for EIS and EPQ-R
- Statistical significance tests (F-test, t-test/Welch)
Datasets
- BFI crowd data (China)
- EPQ-R crowd data
- DTDD undergraduates (US)
- BSRI (Canada)
- CABIN (US workforce)
- EIS/WLEIS/Empathy crowd samples (various literature)
Benchmarks
- PsychoBench (13 psychometric scales)
- TruthfulQA
- SafetyQA

