Overview
The method is straightforward to run via API and yields clear numeric parameters; evidence is moderate because experiments use three closed commercial models and one decision domain (financial lotteries).
Citations3
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
If you use LLMs for advice or automation, models differ in how they treat risk and rare events and demographic prompts can shift behavior; test and calibrate model-specific risk parameters before putting them in decision workflows.
Who Should Care
Summary TLDR
The paper builds a practical test suite (based on a combined EUT/Prospect model called TCN) to measure three decision-making traits in LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ). They run 300 repeated lottery-style queries on three commercial models (ChatGPT-4-Turbo, Claude-3-Opus, Gemini-1.0-pro) in two settings: context-free and with injected socio-demographic personas. Findings: all models show risk aversion on average, but differ in degree; Claude and Gemini overweight small probabilities (α<1) while ChatGPT underweights (α>1); Claude shows much higher loss aversion than humans on these financial games. Injecting demographic features meaningfully shifts L
Problem Statement
LLMs are used in decision support but we lack a simple, data-driven way to quantify whether they behave like humans (risk-averse, overweight small probs, loss-averse) and whether demographic context injects bias into their decisions.
Main Contribution
A practical evaluation framework, grounded in the TCN model, to estimate three decision parameters of LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ).
A large-scale API experiment (300 trials per model) measuring these parameters for ChatGPT-4-Turbo, Claude-3-Opus, and Gemini-1.0-pro in context-free and demographic-embedded prompts.
Key Findings
All three LLMs show average risk-aversion in the context-free setting.
Probability-weighting differs by model: Claude and Gemini overweight small probabilities, ChatGPT underweights them.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Risk preference σ (context-free mean) | ChatGPT 0.6031; Claude 0.3085; Gemini 0.4959 | Human sample mean 0.48 | ChatGPT +0.12; Claude -0.17; Gemini +0.016 | Context-free experiments, 300 trials per model | Table 5 baseline means | Table 5 |
| Probability weighting α (context-free mean) | ChatGPT 1.1819; Claude 0.7613; Gemini 0.8759 | Human sample mean 0.69 | ChatGPT +0.49; Claude -0.23; Gemini +0.19 | Context-free experiments | Table 5 baseline means | Table 5 |
What To Try In 7 Days
Run the authors' lottery prompts on your target LLM and record σ, α, λ to get a behavioral baseline.
Test a small set of persona prompts for target user groups and flag any large parameter shifts (e.g., |Δσ|>0.2).
If using models for recommendations, add a guardrail that human-reviews high-loss or low-probability decisions for at least one week.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only three commercial LLMs tested; results may not generalize to smaller or open models.
Experiments cover financial-style lotteries only; other decision domains may show different patterns.
When Not To Use
Do not rely on these measured parameters to make high-stakes decisions without human oversight.
Do not assume the framework captures moral or ethical reasoning beyond financial risk behavior.
Failure Modes
LLMs inherit training-data biases, producing stereotyped persona-driven advice.
Strong loss aversion (e.g., Claude) may reject rational high-expected-value choices.

