A behavioral-economics framework that measures LLM risk, probability weighting, and loss aversion, and how demographics change those choices

June 10, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.2

Citation Count

3

Authors

Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. McNamara, Deming Chen

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs for advice or automation, models differ in how they treat risk and rare events and demographic prompts can shift behavior; test and calibrate model-specific risk parameters before putting them in decision workflows.

Summary TLDR

The paper builds a practical test suite (based on a combined EUT/Prospect model called TCN) to measure three decision-making traits in LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ). They run 300 repeated lottery-style queries on three commercial models (ChatGPT-4-Turbo, Claude-3-Opus, Gemini-1.0-pro) in two settings: context-free and with injected socio-demographic personas. Findings: all models show risk aversion on average, but differ in degree; Claude and Gemini overweight small probabilities (α<1) while ChatGPT underweights (α>1); Claude shows much higher loss aversion than humans on these financial games. Injecting demographic features meaningfully shifts L

Problem Statement

LLMs are used in decision support but we lack a simple, data-driven way to quantify whether they behave like humans (risk-averse, overweight small probs, loss-averse) and whether demographic context injects bias into their decisions.

Main Contribution

A practical evaluation framework, grounded in the TCN model, to estimate three decision parameters of LLMs: risk preference (σ), probability weighting (α), and loss aversion (λ).

A large-scale API experiment (300 trials per model) measuring these parameters for ChatGPT-4-Turbo, Claude-3-Opus, and Gemini-1.0-pro in context-free and demographic-embedded prompts.

Analysis showing (1) model-level differences in σ, α, λ vs human baselines and (2) systematic parameter shifts when socio-demographic features are embedded, with implications for fairness.

Key Findings

All three LLMs show average risk-aversion in the context-free setting.

Numbersσ means: ChatGPT 0.6031, Claude 0.3085, Gemini 0.4959 (Table 5)

Probability-weighting differs by model: Claude and Gemini overweight small probabilities, ChatGPT underweights them.

Numbersα means: Claude 0.7613 (<1), Gemini 0.8759 (<1), ChatGPT 1.1819 (>1) (Table 5)

Claude exhibits substantially higher loss aversion than the human sample and the other LLMs.

Numbersλ mean: Claude 6.316 vs human sample 3.47; ChatGPT 1.4786; Gemini 2.3333 (Table 5)

Embedding socio-demographic personas changes behavior parameters; ChatGPT becomes notably riskier when given demographic context.

NumbersChatGPT σ: 0.6031 (context-free) → 0.2615 (random features) → 0.2361 (real-distribution) (Table 6 / Table 5 col (1)-(3))

Random vs realistic demographic sampling produced no large parameter differences on average.

NumbersAuthors report no significant difference between random and real-world distributions across models (Section 5.2.1)

Results

Risk preference σ (context-free mean)

ValueChatGPT 0.6031; Claude 0.3085; Gemini 0.4959

BaselineHuman sample mean 0.48

Probability weighting α (context-free mean)

ValueChatGPT 1.1819; Claude 0.7613; Gemini 0.8759

BaselineHuman sample mean 0.69

Loss aversion λ (context-free mean)

ValueChatGPT 1.4786; Claude 6.3160; Gemini 2.3333

BaselineHuman sample mean 3.47

Effect of demographic embedding on ChatGPT σ

Valueσ fell from 0.6031 to 0.2361 after real-distribution persona embedding

BaselineChatGPT context-free σ 0.6031

Who Should Care

What To Try In 7 Days

Run the authors' lottery prompts on your target LLM and record σ, α, λ to get a behavioral baseline.

Test a small set of persona prompts for target user groups and flag any large parameter shifts (e.g., |Δσ|>0.2).

If using models for recommendations, add a guardrail that human-reviews high-loss or low-probability decisions for at least one week.

Reproducibility

Data Urls

  • Raw experiment tables and prompts included in paper appendix

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only three commercial LLMs tested; results may not generalize to smaller or open models.
  • Experiments cover financial-style lotteries only; other decision domains may show different patterns.
  • Direct comparison to human behavior is imperfect: human samples and LLM trials differ in sampling and context sensitivity.

When Not To Use

  • Do not rely on these measured parameters to make high-stakes decisions without human oversight.
  • Do not assume the framework captures moral or ethical reasoning beyond financial risk behavior.
  • Avoid using a single model's baseline as universal; model-specific calibration is required.

Failure Modes

  • LLMs inherit training-data biases, producing stereotyped persona-driven advice.
  • Strong loss aversion (e.g., Claude) may reject rational high-expected-value choices.
  • Probability-weighting mismatches (α far from human norms) can misrank rare-event tradeoffs.

Core Entities

Models

  • ChatGPT-4.0-Turbo
  • Claude-3-Opus
  • Gemini-1.0-pro

Metrics

  • σ (risk preference)
  • α (probability weighting)
  • λ (loss aversion)

Datasets

  • Custom multiple-choice lottery experiments (300 trials per model)
  • World Bank World Development Indicators (used for population distributions)