Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.2
Citation Count
3
Why It Matters For Business
If you use LLMs for advice or automation, models differ in how they treat risk and rare events and demographic prompts can shift behavior; test and calibrate model-specific risk parameters before putting them in decision workflows.
Summary TLDR
The paper builds a practical test suite (based on a combined EUT/Prospect model called TCN) to measure three decision-making traits in LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ). They run 300 repeated lottery-style queries on three commercial models (ChatGPT-4-Turbo, Claude-3-Opus, Gemini-1.0-pro) in two settings: context-free and with injected socio-demographic personas. Findings: all models show risk aversion on average, but differ in degree; Claude and Gemini overweight small probabilities (α<1) while ChatGPT underweights (α>1); Claude shows much higher loss aversion than humans on these financial games. Injecting demographic features meaningfully shifts L
Problem Statement
LLMs are used in decision support but we lack a simple, data-driven way to quantify whether they behave like humans (risk-averse, overweight small probs, loss-averse) and whether demographic context injects bias into their decisions.
Main Contribution
A practical evaluation framework, grounded in the TCN model, to estimate three decision parameters of LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ).
A large-scale API experiment (300 trials per model) measuring these parameters for ChatGPT-4-Turbo, Claude-3-Opus, and Gemini-1.0-pro in context-free and demographic-embedded prompts.
Analysis showing (1) model-level differences in σ, α, λ vs human baselines and (2) systematic parameter shifts when socio-demographic features are embedded, with implications for fairness.
Key Findings
All three LLMs show average risk-aversion in the context-free setting.
Probability-weighting differs by model: Claude and Gemini overweight small probabilities, ChatGPT underweights them.
Claude exhibits substantially higher loss aversion than the human sample and the other LLMs.
Embedding socio-demographic personas changes behavior parameters; ChatGPT becomes notably riskier when given demographic context.
Random vs realistic demographic sampling produced no large parameter differences on average.
Results
Risk preference σ (context-free mean)
Probability weighting α (context-free mean)
Loss aversion λ (context-free mean)
Effect of demographic embedding on ChatGPT σ
Who Should Care
What To Try In 7 Days
Run the authors' lottery prompts on your target LLM and record σ, α, λ to get a behavioral baseline.
Test a small set of persona prompts for target user groups and flag any large parameter shifts (e.g., |Δσ|>0.2).
If using models for recommendations, add a guardrail that human-reviews high-loss or low-probability decisions for at least one week.
Reproducibility
Data Urls
- Raw experiment tables and prompts included in paper appendix
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only three commercial LLMs tested; results may not generalize to smaller or open models.
- Experiments cover financial-style lotteries only; other decision domains may show different patterns.
- Direct comparison to human behavior is imperfect: human samples and LLM trials differ in sampling and context sensitivity.
When Not To Use
- Do not rely on these measured parameters to make high-stakes decisions without human oversight.
- Do not assume the framework captures moral or ethical reasoning beyond financial risk behavior.
- Avoid using a single model's baseline as universal; model-specific calibration is required.
Failure Modes
- LLMs inherit training-data biases, producing stereotyped persona-driven advice.
- Strong loss aversion (e.g., Claude) may reject rational high-expected-value choices.
- Probability-weighting mismatches (α far from human norms) can misrank rare-event tradeoffs.
Core Entities
Models
- ChatGPT-4.0-Turbo
- Claude-3-Opus
- Gemini-1.0-pro
Metrics
- σ (risk preference)
- α (probability weighting)
- λ (loss aversion)
Datasets
- Custom multiple-choice lottery experiments (300 trials per model)
- World Bank World Development Indicators (used for population distributions)

