A behavioral-economics framework that measures LLM risk, probability weighting, and loss aversion, and how demographics change those choices

Overview

Decision SnapshotReady For Pilot

The method is straightforward to run via API and yields clear numeric parameters; evidence is moderate because experiments use three closed commercial models and one decision domain (financial lotteries).

Citations3

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 60%

Novelty: 50%

Authors

Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. McNamara, Deming Chen

Links

Abstract / PDF / Data

Why It Matters For Business

If you use LLMs for advice or automation, models differ in how they treat risk and rare events and demographic prompts can shift behavior; test and calibrate model-specific risk parameters before putting them in decision workflows.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

The paper builds a practical test suite (based on a combined EUT/Prospect model called TCN) to measure three decision-making traits in LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ). They run 300 repeated lottery-style queries on three commercial models (ChatGPT-4-Turbo, Claude-3-Opus, Gemini-1.0-pro) in two settings: context-free and with injected socio-demographic personas. Findings: all models show risk aversion on average, but differ in degree; Claude and Gemini overweight small probabilities (α<1) while ChatGPT underweights (α>1); Claude shows much higher loss aversion than humans on these financial games. Injecting demographic features meaningfully shifts L

Problem Statement

LLMs are used in decision support but we lack a simple, data-driven way to quantify whether they behave like humans (risk-averse, overweight small probs, loss-averse) and whether demographic context injects bias into their decisions.

Main Contribution

A practical evaluation framework, grounded in the TCN model, to estimate three decision parameters of LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ).

A large-scale API experiment (300 trials per model) measuring these parameters for ChatGPT-4-Turbo, Claude-3-Opus, and Gemini-1.0-pro in context-free and demographic-embedded prompts.

Key Findings

All three LLMs show average risk-aversion in the context-free setting.

Numbersσ means: ChatGPT 0.6031, Claude 0.3085, Gemini 0.4959 (Table 5)

Practical UseExpect LLM recommendations in ambiguous financial choices to lean conservative; evaluate model-specific risk levels before deploying in advice systems.

Evidence RefTable 5 (Baseline means)

Probability-weighting differs by model: Claude and Gemini overweight small probabilities, ChatGPT underweights them.

Numbersα means: Claude 0.7613 (<1), Gemini 0.8759 (<1), ChatGPT 1.1819 (>1) (Table 5)

Practical UseClaude and Gemini may exaggerate rare outcomes (e.g., overemphasize unlikely risks); ChatGPT may downplay rare events. Test α before using models for rare-event advice.

Evidence RefTable 5 (α means)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Risk preference σ (context-free mean)	ChatGPT 0.6031; Claude 0.3085; Gemini 0.4959	Human sample mean 0.48	ChatGPT +0.12; Claude -0.17; Gemini +0.016	Context-free experiments, 300 trials per model	Table 5 baseline means	Table 5
Probability weighting α (context-free mean)	ChatGPT 1.1819; Claude 0.7613; Gemini 0.8759	Human sample mean 0.69	ChatGPT +0.49; Claude -0.23; Gemini +0.19	Context-free experiments	Table 5 baseline means	Table 5

What To Try In 7 Days

Run the authors' lottery prompts on your target LLM and record σ, α, λ to get a behavioral baseline.

Test a small set of persona prompts for target user groups and flag any large parameter shifts (e.g., |Δσ|>0.2).

If using models for recommendations, add a guardrail that human-reviews high-loss or low-probability decisions for at least one week.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Raw experiment tables and prompts included in paper appendix

Risks & Boundaries

Limitations

Only three commercial LLMs tested; results may not generalize to smaller or open models.

Experiments cover financial-style lotteries only; other decision domains may show different patterns.

When Not To Use

Do not rely on these measured parameters to make high-stakes decisions without human oversight.

Do not assume the framework captures moral or ethical reasoning beyond financial risk behavior.

Failure Modes

LLMs inherit training-data biases, producing stereotyped persona-driven advice.

Strong loss aversion (e.g., Claude) may reject rational high-expected-value choices.

Core Entities

Models

ChatGPT-4.0-TurboClaude-3-OpusGemini-1.0-pro

Metrics

σ (risk preference)α (probability weighting)λ (loss aversion)

Datasets

Custom multiple-choice lottery experiments (300 trials per model)World Bank World Development Indicators (used for population distributions)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

All three LLMs show average risk-aversion in the context-free setting.

Probability-weighting differs by model: Claude and Gemini overweight small probabilities, ChatGPT underweights them.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding