Overview
The benchmark applies established psychometric scales and standard stats to model outputs across multiple LLMs and prompt settings; results are consistent but comparisons to human norms are limited by demographic differences and subjective scale design.
Citations6
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
PsychoBench gives a repeatable way to describe how an LLM will sound and react, so teams can tune persona, anticipate safety shifts from prompts or alignment changes, and audit models before deployment.
Who Should Care
Summary TLDR
The authors build PsychoBench, a practical benchmark that runs 13 standard psychometric scales (Big Five, Dark Triad, EI, self-efficacy, vocational interests, etc.) against LLMs. They test five models (text-davinci-003, ChatGPT/gpt-3.5-turbo, GPT-4, LLaMA-2-7B, LLaMA-2-13B) and a jailbroken GPT-4. Main findings: LLMs often appear more open, conscientious, extraverted and emotionally intelligent than average human samples; some models score higher on dark-triad and 'lying' scales; jailbreaks and role prompts change measured profiles and safety behavior. The repo is public for reuse.
Problem Statement
We lack systematic tools to describe what personality-like traits, motivations, relationship styles, and emotional abilities LLMs display. PsychoBench adapts 13 clinical psychometric scales so practitioners can measure and compare the psychological portrait of different LLMs and prompt settings.
Main Contribution
PsychoBench: a reusable framework that translates 13 standard psychometric scales into prompts and analysis code.
A comparative evaluation of five LLMs (OpenAI and LLaMA-2 variants), including a jailbroken GPT-4, across those scales.
Key Findings
LLMs behave as more open, conscientious and extraverted than crowd norms.
LLMs often score higher on emotional intelligence measures than the human samples used.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Openness (BFI) | text-davinci-003 4.8 mean | human 3.9 mean | +0.9 | Table 3 (BFI) | text-davinci-003 Openness 4.8±0.2; human 3.9±0.7 | Table 3 |
| Emotional Intelligence (EIS) | gpt-4 151.4 sum | human male 124.8 sum | +26.6 | Table 6 (EIS) | GPT-4 EIS 151.4±18.7; human male 124.8±16.5 | Table 6 |
What To Try In 7 Days
Run PsychoBench on your production model to profile personality, EI, and motivation.
Test common role prompts and one jailbreak-like variant to see how guardrails affect safety and empathy.
Compare results against a relevant human sample you control (same demographics) before using profiles in UX decisions.
Reproducibility
Risks & Boundaries
Limitations
Focuses on Likert-style questionnaires only; other psychological methods are not covered.
Human comparison groups come from varied demographic sources, so cross-population claims are limited.
When Not To Use
Do not use PsychoBench scores as clinical or therapeutic validation.
Do not assume benchmarked human norms represent all cultures or populations.
Failure Modes
Alignment layers can mask true model tendencies, producing different scores under jailbreaks.
Models may refuse or systematically bias answers to present 'socially desirable' responses.

